[jira] [Commented] (SPARK-16965) Fix bound checking for SparseVector

2016-08-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412965#comment-15412965
 ] 

Apache Spark commented on SPARK-16965:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/14555

> Fix bound checking for SparseVector
> ---
>
> Key: SPARK-16965
> URL: https://issues.apache.org/jira/browse/SPARK-16965
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Priority: Minor
>
> There's several issues in the bound checking of SparseVector
> 1. In scala, miss negative index checking and different bound checking is 
> scattered in several places. Should put them in one place
> 2. In python, miss low/upper bound checking of indices. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16965) Fix bound checking for SparseVector

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16965:


Assignee: (was: Apache Spark)

> Fix bound checking for SparseVector
> ---
>
> Key: SPARK-16965
> URL: https://issues.apache.org/jira/browse/SPARK-16965
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Priority: Minor
>
> There's several issues in the bound checking of SparseVector
> 1. In scala, miss negative index checking and different bound checking is 
> scattered in several places. Should put them in one place
> 2. In python, miss low/upper bound checking of indices. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16965) Fix bound checking for SparseVector

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16965:


Assignee: Apache Spark

> Fix bound checking for SparseVector
> ---
>
> Key: SPARK-16965
> URL: https://issues.apache.org/jira/browse/SPARK-16965
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> There's several issues in the bound checking of SparseVector
> 1. In scala, miss negative index checking and different bound checking is 
> scattered in several places. Should put them in one place
> 2. In python, miss low/upper bound checking of indices. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16965) Fix bound checking for SparseVector

2016-08-08 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-16965:
---
Component/s: PySpark
 MLlib

> Fix bound checking for SparseVector
> ---
>
> Key: SPARK-16965
> URL: https://issues.apache.org/jira/browse/SPARK-16965
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Priority: Minor
>
> There's several issues in the bound checking of SparseVector
> 1. In scala, miss negative index checking and different bound checking is 
> scattered in several places. Should put them in one place
> 2. In python, miss low/upper bound checking of indices. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16887) Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH

2016-08-08 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16887.
--
Resolution: Won't Fix

> Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH
> 
>
> Key: SPARK-16887
> URL: https://issues.apache.org/jira/browse/SPARK-16887
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> To deploy Spark, it can be pretty convenient to put all jars (spark jars, 
> hadoop jars, and other libs' jars) that we want to include in the classpath 
> of Spark in the same dir, which may not be spark's assembly dir. So, I am 
> proposing to also add SPARK_DIST_CLASSPATH to the LAUNCH_CLASSPATH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16965) Fix bound checking for SparseVector

2016-08-08 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-16965:
--

 Summary: Fix bound checking for SparseVector
 Key: SPARK-16965
 URL: https://issues.apache.org/jira/browse/SPARK-16965
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Jeff Zhang
Priority: Minor


There's several issues in the bound checking of SparseVector

1. In scala, miss negative index checking and different bound checking is 
scattered in several places. Should put them in one place
2. In python, miss low/upper bound checking of indices. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16964) Remove private[sql] and private[spark] from sql.execution package

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16964:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove private[sql] and private[spark] from sql.execution package
> -
>
> Key: SPARK-16964
> URL: https://issues.apache.org/jira/browse/SPARK-16964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> The execution package is meant to be internal, and as a result it does not 
> make sense to mark things as private[sql] or private[spark]. It simply makes 
> debugging harder when Spark developers need to inspect the plans at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16964) Remove private[sql] and private[spark] from sql.execution package

2016-08-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412907#comment-15412907
 ] 

Apache Spark commented on SPARK-16964:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14554

> Remove private[sql] and private[spark] from sql.execution package
> -
>
> Key: SPARK-16964
> URL: https://issues.apache.org/jira/browse/SPARK-16964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The execution package is meant to be internal, and as a result it does not 
> make sense to mark things as private[sql] or private[spark]. It simply makes 
> debugging harder when Spark developers need to inspect the plans at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16964) Remove private[sql] and private[spark] from sql.execution package

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16964:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove private[sql] and private[spark] from sql.execution package
> -
>
> Key: SPARK-16964
> URL: https://issues.apache.org/jira/browse/SPARK-16964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The execution package is meant to be internal, and as a result it does not 
> make sense to mark things as private[sql] or private[spark]. It simply makes 
> debugging harder when Spark developers need to inspect the plans at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16964) Remove private[sql] and private[spark] from sql.execution package

2016-08-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16964:

Description: 
The execution package is meant to be internal, and as a result it does not make 
sense to mark things as private[sql] or private[spark]. It simply makes 
debugging harder when Spark developers need to inspect the plans at runtime.


> Remove private[sql] and private[spark] from sql.execution package
> -
>
> Key: SPARK-16964
> URL: https://issues.apache.org/jira/browse/SPARK-16964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The execution package is meant to be internal, and as a result it does not 
> make sense to mark things as private[sql] or private[spark]. It simply makes 
> debugging harder when Spark developers need to inspect the plans at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16964) Remove private[sql] and private[spark] from sql.execution package

2016-08-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-16964:
---

 Summary: Remove private[sql] and private[spark] from sql.execution 
package
 Key: SPARK-16964
 URL: https://issues.apache.org/jira/browse/SPARK-16964
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals

2016-08-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412893#comment-15412893
 ] 

Dongjoon Hyun commented on SPARK-16955:
---

Hi, [~yhuai].
Could you review the PR?
The root cause was `ResolveAggregateFunctions` removed the ordinal sort orders 
too early.
After improving the `if` condition to check the resolution is completed, the 
case works well.

> Using ordinals in ORDER BY causes an analysis error when the query has a 
> GROUP BY clause using ordinals
> ---
>
> Key: SPARK-16955
> URL: https://issues.apache.org/jira/browse/SPARK-16955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> The following queries work
> {code}
> select a from (select 1 as a) tmp order by 1
> select a, count(*) from (select 1 as a) tmp group by 1
> select a, count(*) from (select 1 as a) tmp group by 1 order by a
> {code}
> However, the following query does not
> {code}
> select a, count(*) from (select 1 as a) tmp group by 1 order by 1
> {code}
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> Group by position: '1' exceeds the size of the select list '0'. on unresolved 
> object, tree:
> Aggregate [1]
> +- SubqueryAlias tmp
>+- Project [1 AS a#82]
>   +- OneRowRelation$
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> 

[jira] [Comment Edited] (SPARK-16951) Alternative implementation of NOT IN to Anti-join

2016-08-08 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412882#comment-15412882
 ] 

Nattavut Sutyanyong edited comment on SPARK-16951 at 8/9/16 3:54 AM:
-

The following output is tested on Spark master trunk built on August 5, 2016.

{noformat}
scala> Seq(1,2).toDF("c1").createOrReplaceTempView("t1")

scala> Seq(1).toDF("c2").createOrReplaceTempView("t2")

scala> sql("select t2.c2+1 as c3 from t1 left join t2 on 
t1.c1=t2.c2").createOrReplaceTempView("t3")

scala> sql("select * from t1").show
+---+
| c1|
+---+
|  1|
|  2|
+---+


scala> sql("select * from t2").show
+---+
| c2|
+---+
|  1|
+---+


scala> sql("select * from t3").show
++
|  c3|
++
|   2|
|null|
++
{noformat}

Case 1:
{noformat}
scala> sql("select * from t3 where c3 not in (select c2 from t2)").show
++
|  c3|
++
|   2|
|null|
++
{noformat}
The correct result is:
{noformat}
++
|  c3|
++
|   2|
++
{noformat}

Case 2:
{noformat}
scala> sql("select * from t1 where c1 not in (select c3 from t3)").show
+---+
| c1|
+---+
+---+
{noformat}

The answer is correct.

Case 3:
{noformat}
scala> sql("select * from t1 where c1 not in (select c2 from t2 where 
1=2)").show
+---+
| c1|
+---+
|  1|
|  2|
+---+
{noformat}

The correct result is:
{noformat}
+---+
| c1|
+---+
+---+
{noformat}


was (Author: nsyca):
The following output is tested on Spark master trunk built on August 5, 2016.


scala> Seq(1,2).toDF("c1").createOrReplaceTempView("t1")

scala> Seq(1).toDF("c2").createOrReplaceTempView("t2")

scala> sql("select t2.c2+1 as c3 from t1 left join t2 on 
t1.c1=t2.c2").createOrReplaceTempView("t3")

scala> sql("select * from t1").show
+---+
| c1|
+---+
|  1|
|  2|
+---+


scala> sql("select * from t2").show
+---+
| c2|
+---+
|  1|
+---+


scala> sql("select * from t3").show
++
|  c3|
++
|   2|
|null|
++


Case 1:

scala> sql("select * from t3 where c3 not in (select c2 from t2)").show
++
|  c3|
++
|   2|
|null|
++

The correct result is:

++
|  c3|
++
|   2|
++


Case 2:

scala> sql("select * from t1 where c1 not in (select c3 from t3)").show
+---+
| c1|
+---+
+---+


The answer is correct.

Case 3:

scala> sql("select * from t1 where c1 not in (select c2 from t2 where 
1=2)").show
+---+
| c1|
+---+
|  1|
|  2|
+---+


The correct result is:

+---+
| c1|
+---+
+---+


> Alternative implementation of NOT IN to Anti-join
> -
>
> Key: SPARK-16951
> URL: https://issues.apache.org/jira/browse/SPARK-16951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>
> A transformation currently used to process {{NOT IN}} subquery is to rewrite 
> to a form of Anti-join with null-aware property in the Logical Plan and then 
> translate to a form of {{OR}} predicate joining the parent side and the 
> subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} 
> predicate is limited to the nested-loop join execution plan, which will have 
> a major performance implication if both sides' results are large.
> This JIRA sketches an idea of changing the OR predicate to a form similar to 
> the technique used in the implementation of the Existence join that addresses 
> the problem of {{EXISTS (..) OR ..}} type of queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16951) Alternative implementation of NOT IN to Anti-join

2016-08-08 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412882#comment-15412882
 ] 

Nattavut Sutyanyong commented on SPARK-16951:
-

The following output is tested on Spark master trunk built on August 5, 2016.


scala> Seq(1,2).toDF("c1").createOrReplaceTempView("t1")

scala> Seq(1).toDF("c2").createOrReplaceTempView("t2")

scala> sql("select t2.c2+1 as c3 from t1 left join t2 on 
t1.c1=t2.c2").createOrReplaceTempView("t3")

scala> sql("select * from t1").show
+---+
| c1|
+---+
|  1|
|  2|
+---+


scala> sql("select * from t2").show
+---+
| c2|
+---+
|  1|
+---+


scala> sql("select * from t3").show
++
|  c3|
++
|   2|
|null|
++


Case 1:

scala> sql("select * from t3 where c3 not in (select c2 from t2)").show
++
|  c3|
++
|   2|
|null|
++

The correct result is:

++
|  c3|
++
|   2|
++


Case 2:

scala> sql("select * from t1 where c1 not in (select c3 from t3)").show
+---+
| c1|
+---+
+---+


The answer is correct.

Case 3:

scala> sql("select * from t1 where c1 not in (select c2 from t2 where 
1=2)").show
+---+
| c1|
+---+
|  1|
|  2|
+---+


The correct result is:

+---+
| c1|
+---+
+---+


> Alternative implementation of NOT IN to Anti-join
> -
>
> Key: SPARK-16951
> URL: https://issues.apache.org/jira/browse/SPARK-16951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>
> A transformation currently used to process {{NOT IN}} subquery is to rewrite 
> to a form of Anti-join with null-aware property in the Logical Plan and then 
> translate to a form of {{OR}} predicate joining the parent side and the 
> subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} 
> predicate is limited to the nested-loop join execution plan, which will have 
> a major performance implication if both sides' results are large.
> This JIRA sketches an idea of changing the OR predicate to a form similar to 
> the technique used in the implementation of the Existence join that addresses 
> the problem of {{EXISTS (..) OR ..}} type of queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16951) Alternative implementation of NOT IN to Anti-join

2016-08-08 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412878#comment-15412878
 ] 

Nattavut Sutyanyong commented on SPARK-16951:
-

The semantic of {{NOT IN}} is described in detail in "[Subqueries in Apache 
Spark 
2.0|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/1483312212640900/6987336228780374/latest.html];.
 Concisely,

"{{x NOT IN (subquery y)}} translates into: {{x <> y1 AND x <> y2 ... AND x <> 
yn}}"

When {{x}} and {{subquery y}} cannot produce {{NULL}} value, {{NOT IN}} is 
equivalent to its {{NOT EXIST}} counterpart. That is,

{{SELECT .. FROM X WHERE X.C1 NOT IN (SELECT Y.C2 FROM Y)}}

is equivalent to

{{SELECT .. FROM X WHERE NOT EXISTS (SELECT 1 FROM Y WHERE X.C1=Y.C2)}}

however, there are 3 edge cases we need to pay attention to.

Case 1. When {{X.C1}} is {{NULL}}, the row is removed from the result set.
Case 2. When the {{subquery Y}} can produce {{NULL}} value to the output column 
{{Y.C2}}, the result is an empty set.
Case 3. When the {{subquery Y}} produce an empty set, SQL language defines that 
the subquery will return a row of {{NULL}} value, hence this is like case 2 
which returns an empty set.

> Alternative implementation of NOT IN to Anti-join
> -
>
> Key: SPARK-16951
> URL: https://issues.apache.org/jira/browse/SPARK-16951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>
> A transformation currently used to process {{NOT IN}} subquery is to rewrite 
> to a form of Anti-join with null-aware property in the Logical Plan and then 
> translate to a form of {{OR}} predicate joining the parent side and the 
> subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} 
> predicate is limited to the nested-loop join execution plan, which will have 
> a major performance implication if both sides' results are large.
> This JIRA sketches an idea of changing the OR predicate to a form similar to 
> the technique used in the implementation of the Existence join that addresses 
> the problem of {{EXISTS (..) OR ..}} type of queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12920) Honor "spark.ui.retainedStages" to reduce mem-pressure

2016-08-08 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-12920:
-
Summary: Honor "spark.ui.retainedStages" to reduce mem-pressure  (was: Fix 
high CPU usage in spark thrift server with concurrent users)

> Honor "spark.ui.retainedStages" to reduce mem-pressure
> --
>
> Key: SPARK-12920
> URL: https://issues.apache.org/jira/browse/SPARK-12920
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
> Attachments: SPARK-12920.profiler.png, 
> SPARK-12920.profiler_job_progress_listner.png
>
>
> - Configured with fair-share-scheduler.
> - 4-5 users submitting/running jobs concurrently via spark-thrift-server
> - Spark thrift server spikes to1600+% CPU and stays there for long time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results

2016-08-08 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412849#comment-15412849
 ] 

Nattavut Sutyanyong commented on SPARK-16804:
-

Thank you, @hvanhovell, for merging my PR.

> Correlated subqueries containing non-deterministic operators return incorrect 
> results
> -
>
> Key: SPARK-16804
> URL: https://issues.apache.org/jira/browse/SPARK-16804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
> Fix For: 2.1.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+ 
>   
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16963) Change Source API so that sources do not need to keep unbounded state

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16963:


Assignee: Apache Spark

> Change Source API so that sources do not need to keep unbounded state
> -
>
> Key: SPARK-16963
> URL: https://issues.apache.org/jira/browse/SPARK-16963
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Frederick Reiss
>Assignee: Apache Spark
>
> The version of the Source API in Spark 2.0.0 defines a single getBatch() 
> method for fetching records from the source, with the following Scaladoc 
> comments defining the semantics:
> {noformat}
> /**
>  * Returns the data that is between the offsets (`start`, `end`]. When 
> `start` is `None` then
>  * the batch should begin with the first available record. This method must 
> always return the
>  * same data for a particular `start` and `end` pair.
>  */
> def getBatch(start: Option[Offset], end: Offset): DataFrame
> {noformat}
> These semantics mean that a Source must retain all past history for the 
> stream that it backs. Further, a Source is also required to retain this data 
> across restarts of the process where the Source is instantiated, even when 
> the Source is restarted on a different machine.
> These restrictions make it difficult to implement the Source API, as any 
> implementation requires potentially unbounded amounts of distributed storage.
> See the mailing list thread at 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html]
>  for more information.
> This JIRA will cover augmenting the Source API with an additional callback 
> that will allow Structured Streaming scheduler to notify the source when it 
> is safe to discard buffered data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16963) Change Source API so that sources do not need to keep unbounded state

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16963:


Assignee: (was: Apache Spark)

> Change Source API so that sources do not need to keep unbounded state
> -
>
> Key: SPARK-16963
> URL: https://issues.apache.org/jira/browse/SPARK-16963
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Frederick Reiss
>
> The version of the Source API in Spark 2.0.0 defines a single getBatch() 
> method for fetching records from the source, with the following Scaladoc 
> comments defining the semantics:
> {noformat}
> /**
>  * Returns the data that is between the offsets (`start`, `end`]. When 
> `start` is `None` then
>  * the batch should begin with the first available record. This method must 
> always return the
>  * same data for a particular `start` and `end` pair.
>  */
> def getBatch(start: Option[Offset], end: Offset): DataFrame
> {noformat}
> These semantics mean that a Source must retain all past history for the 
> stream that it backs. Further, a Source is also required to retain this data 
> across restarts of the process where the Source is instantiated, even when 
> the Source is restarted on a different machine.
> These restrictions make it difficult to implement the Source API, as any 
> implementation requires potentially unbounded amounts of distributed storage.
> See the mailing list thread at 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html]
>  for more information.
> This JIRA will cover augmenting the Source API with an additional callback 
> that will allow Structured Streaming scheduler to notify the source when it 
> is safe to discard buffered data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16963) Change Source API so that sources do not need to keep unbounded state

2016-08-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412840#comment-15412840
 ] 

Apache Spark commented on SPARK-16963:
--

User 'frreiss' has created a pull request for this issue:
https://github.com/apache/spark/pull/14553

> Change Source API so that sources do not need to keep unbounded state
> -
>
> Key: SPARK-16963
> URL: https://issues.apache.org/jira/browse/SPARK-16963
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Frederick Reiss
>
> The version of the Source API in Spark 2.0.0 defines a single getBatch() 
> method for fetching records from the source, with the following Scaladoc 
> comments defining the semantics:
> {noformat}
> /**
>  * Returns the data that is between the offsets (`start`, `end`]. When 
> `start` is `None` then
>  * the batch should begin with the first available record. This method must 
> always return the
>  * same data for a particular `start` and `end` pair.
>  */
> def getBatch(start: Option[Offset], end: Offset): DataFrame
> {noformat}
> These semantics mean that a Source must retain all past history for the 
> stream that it backs. Further, a Source is also required to retain this data 
> across restarts of the process where the Source is instantiated, even when 
> the Source is restarted on a different machine.
> These restrictions make it difficult to implement the Source API, as any 
> implementation requires potentially unbounded amounts of distributed storage.
> See the mailing list thread at 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html]
>  for more information.
> This JIRA will cover augmenting the Source API with an additional callback 
> that will allow Structured Streaming scheduler to notify the source when it 
> is safe to discard buffered data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options

2016-08-08 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16610:

Assignee: Hyukjin Kwon

> When writing ORC files, orc.compress should not be overridden if users do not 
> set "compression" in the options
> --
>
> Key: SPARK-16610
> URL: https://issues.apache.org/jira/browse/SPARK-16610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Hyukjin Kwon
> Fix For: 2.0.1, 2.1.0
>
>
> For ORC source, Spark SQL has a writer option {{compression}}, which is used 
> to set the codec and its value will be also set to orc.compress (the orc conf 
> used for codec). However, if a user only set {{orc.compress}} in the writer 
> option, we should not use the default value of "compression" (snappy) as the 
> codec. Instead, we should respect the value of {{orc.compress}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options

2016-08-08 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16610.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14518
[https://github.com/apache/spark/pull/14518]

> When writing ORC files, orc.compress should not be overridden if users do not 
> set "compression" in the options
> --
>
> Key: SPARK-16610
> URL: https://issues.apache.org/jira/browse/SPARK-16610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
> Fix For: 2.0.1, 2.1.0
>
>
> For ORC source, Spark SQL has a writer option {{compression}}, which is used 
> to set the codec and its value will be also set to orc.compress (the orc conf 
> used for codec). However, if a user only set {{orc.compress}} in the writer 
> option, we should not use the default value of "compression" (snappy) as the 
> codec. Instead, we should respect the value of {{orc.compress}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16963) Change Source API so that sources do not need to keep unbounded state

2016-08-08 Thread Frederick Reiss (JIRA)
Frederick Reiss created SPARK-16963:
---

 Summary: Change Source API so that sources do not need to keep 
unbounded state
 Key: SPARK-16963
 URL: https://issues.apache.org/jira/browse/SPARK-16963
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 2.0.0
Reporter: Frederick Reiss


The version of the Source API in Spark 2.0.0 defines a single getBatch() method 
for fetching records from the source, with the following Scaladoc comments 
defining the semantics:

{noformat}
/**
 * Returns the data that is between the offsets (`start`, `end`]. When `start` 
is `None` then
 * the batch should begin with the first available record. This method must 
always return the
 * same data for a particular `start` and `end` pair.
 */
def getBatch(start: Option[Offset], end: Offset): DataFrame
{noformat}
These semantics mean that a Source must retain all past history for the stream 
that it backs. Further, a Source is also required to retain this data across 
restarts of the process where the Source is instantiated, even when the Source 
is restarted on a different machine.
These restrictions make it difficult to implement the Source API, as any 
implementation requires potentially unbounded amounts of distributed storage.
See the mailing list thread at 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html]
 for more information.
This JIRA will cover augmenting the Source API with an additional callback that 
will allow Structured Streaming scheduler to notify the source when it is safe 
to discard buffered data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16952) [MESOS] MesosCoarseGrainedSchedulerBackend requires spark.mesos.executor.home even if spark.executor.uri is set

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16952:


Assignee: (was: Apache Spark)

> [MESOS] MesosCoarseGrainedSchedulerBackend requires spark.mesos.executor.home 
> even if spark.executor.uri is set
> ---
>
> Key: SPARK-16952
> URL: https://issues.apache.org/jira/browse/SPARK-16952
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler
>Affects Versions: 1.5.2, 1.6.0, 1.6.1, 2.0.0
>Reporter: Charles Allen
>Priority: Minor
>
> In the Mesos coarse grained scheduler, setting `spark.executor.uri` bypasses 
> the code path which requires `spark.mesos.executor.home` since the uri 
> effectively provides the executor home.
> But 
> `org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend#createCommand`
>  requires `spark.mesos.executor.home` to be set regardless.
> Our workaround is to set `spark.mesos.executor.home=/dev/null` when using an 
> executor uri.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16952) [MESOS] MesosCoarseGrainedSchedulerBackend requires spark.mesos.executor.home even if spark.executor.uri is set

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16952:


Assignee: Apache Spark

> [MESOS] MesosCoarseGrainedSchedulerBackend requires spark.mesos.executor.home 
> even if spark.executor.uri is set
> ---
>
> Key: SPARK-16952
> URL: https://issues.apache.org/jira/browse/SPARK-16952
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler
>Affects Versions: 1.5.2, 1.6.0, 1.6.1, 2.0.0
>Reporter: Charles Allen
>Assignee: Apache Spark
>Priority: Minor
>
> In the Mesos coarse grained scheduler, setting `spark.executor.uri` bypasses 
> the code path which requires `spark.mesos.executor.home` since the uri 
> effectively provides the executor home.
> But 
> `org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend#createCommand`
>  requires `spark.mesos.executor.home` to be set regardless.
> Our workaround is to set `spark.mesos.executor.home=/dev/null` when using an 
> executor uri.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16952) [MESOS] MesosCoarseGrainedSchedulerBackend requires spark.mesos.executor.home even if spark.executor.uri is set

2016-08-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412825#comment-15412825
 ] 

Apache Spark commented on SPARK-16952:
--

User 'mgummelt' has created a pull request for this issue:
https://github.com/apache/spark/pull/14552

> [MESOS] MesosCoarseGrainedSchedulerBackend requires spark.mesos.executor.home 
> even if spark.executor.uri is set
> ---
>
> Key: SPARK-16952
> URL: https://issues.apache.org/jira/browse/SPARK-16952
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler
>Affects Versions: 1.5.2, 1.6.0, 1.6.1, 2.0.0
>Reporter: Charles Allen
>Priority: Minor
>
> In the Mesos coarse grained scheduler, setting `spark.executor.uri` bypasses 
> the code path which requires `spark.mesos.executor.home` since the uri 
> effectively provides the executor home.
> But 
> `org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend#createCommand`
>  requires `spark.mesos.executor.home` to be set regardless.
> Our workaround is to set `spark.mesos.executor.home=/dev/null` when using an 
> executor uri.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16962) Unsafe accesses (Platform.getLong()) not supported on unaligned boundaries in SPARC/Solaris

2016-08-08 Thread Suman Somasundar (JIRA)
Suman Somasundar created SPARK-16962:


 Summary: Unsafe accesses (Platform.getLong()) not supported on 
unaligned boundaries in SPARC/Solaris
 Key: SPARK-16962
 URL: https://issues.apache.org/jira/browse/SPARK-16962
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
 Environment: SPARC/Solaris
Reporter: Suman Somasundar


Unaligned accesses are not supported on SPARC architecture. Because of this, 
Spark applications fail by dumping core on SPARC machines whenever unaligned 
accesses happen. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16961) Utils.randomizeInPlace does not shuffle arrays uniformly

2016-08-08 Thread Nicholas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas updated SPARK-16961:
-
Comment: was deleted

(was: I am submitting a PR)

> Utils.randomizeInPlace does not shuffle arrays uniformly
> 
>
> Key: SPARK-16961
> URL: https://issues.apache.org/jira/browse/SPARK-16961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Nicholas
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The Utils.randomizeInPlace method, which is meant to uniformly shuffle the 
> elements on an input array, never shuffles elements to their starting 
> position. That is, every permutation of the input array is equally likely to 
> be returned, except for any permutation in which any element is in the same 
> position where it began. These permutations are never output.
> This is because line 827 of Utils.scala should be
> {{val j = rand.nextInt(i + 1)}}
> instead of
> {{val j = rand.nextInt( i )}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16961) Utils.randomizeInPlace does not shuffle arrays uniformly

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16961:


Assignee: (was: Apache Spark)

> Utils.randomizeInPlace does not shuffle arrays uniformly
> 
>
> Key: SPARK-16961
> URL: https://issues.apache.org/jira/browse/SPARK-16961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Nicholas
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The Utils.randomizeInPlace method, which is meant to uniformly shuffle the 
> elements on an input array, never shuffles elements to their starting 
> position. That is, every permutation of the input array is equally likely to 
> be returned, except for any permutation in which any element is in the same 
> position where it began. These permutations are never output.
> This is because line 827 of Utils.scala should be
> {{val j = rand.nextInt(i + 1)}}
> instead of
> {{val j = rand.nextInt( i )}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16961) Utils.randomizeInPlace does not shuffle arrays uniformly

2016-08-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412783#comment-15412783
 ] 

Apache Spark commented on SPARK-16961:
--

User 'nicklavers' has created a pull request for this issue:
https://github.com/apache/spark/pull/14551

> Utils.randomizeInPlace does not shuffle arrays uniformly
> 
>
> Key: SPARK-16961
> URL: https://issues.apache.org/jira/browse/SPARK-16961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Nicholas
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The Utils.randomizeInPlace method, which is meant to uniformly shuffle the 
> elements on an input array, never shuffles elements to their starting 
> position. That is, every permutation of the input array is equally likely to 
> be returned, except for any permutation in which any element is in the same 
> position where it began. These permutations are never output.
> This is because line 827 of Utils.scala should be
> {{val j = rand.nextInt(i + 1)}}
> instead of
> {{val j = rand.nextInt( i )}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16961) Utils.randomizeInPlace does not shuffle arrays uniformly

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16961:


Assignee: Apache Spark

> Utils.randomizeInPlace does not shuffle arrays uniformly
> 
>
> Key: SPARK-16961
> URL: https://issues.apache.org/jira/browse/SPARK-16961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Nicholas
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The Utils.randomizeInPlace method, which is meant to uniformly shuffle the 
> elements on an input array, never shuffles elements to their starting 
> position. That is, every permutation of the input array is equally likely to 
> be returned, except for any permutation in which any element is in the same 
> position where it began. These permutations are never output.
> This is because line 827 of Utils.scala should be
> {{val j = rand.nextInt(i + 1)}}
> instead of
> {{val j = rand.nextInt( i )}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16563) Repeat calling Spark SQL thrift server fetchResults return empty for ExecuteStatement operation

2016-08-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16563.
-
   Resolution: Fixed
 Assignee: Gu Huiqin Alice
Fix Version/s: 2.1.0
   2.0.1

> Repeat calling Spark SQL thrift server fetchResults return empty for 
> ExecuteStatement operation
> ---
>
> Key: SPARK-16563
> URL: https://issues.apache.org/jira/browse/SPARK-16563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Gu Huiqin Alice
>Assignee: Gu Huiqin Alice
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> Repeat calling FetchResults(... orientation=FetchOrientation.FETCH_FIRST ..) 
> of spark sql thrift service will return empty set after calling 
> ExecuteStatement of TCLIService. 
> The bug exist in *function public RowSet getNextRowSet(FetchOrientation 
> orientation, long maxRows)*
> https://github.com/apache/spark/blob/02c8072eea72425e89256347e1f373a3e76e6eba/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/SQLOperation.java#L332
> The iterator for geting result can be used for only once, so repeat calling 
> FetchResults with FETCH_FIRST parameter will return empty result. 
> FetchOrientation.FETCH_FIRST



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-08-08 Thread Brian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412727#comment-15412727
 ] 

Brian commented on SPARK-6235:
--

How is it possible that Spark 2.0 comes and and this bug isn't solved?  A quick 
Google search fort "Spark 2GB limit" or "Spark Integer.MAX_VALUE" shows that 
this is a very real problem that affects lots of users.  From the outside 
looking in, it seems like the Spark developers don't have an interest in 
solving this bug since it's been around for years at this point (including the 
jiras this consolidated ticket replaced).  Can you provide some sort of an 
update?  Maybe if you don't plan on fixing this issue, you can close the ticket 
or mark it as won't fix.  At least that way we'd have some insight in to your 
plansThanks!

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16898) Adds argument type information for typed logical plan like MapElements, TypedFilter, and AppendColumn

2016-08-08 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16898.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14494
[https://github.com/apache/spark/pull/14494]

> Adds argument type information for typed logical plan like MapElements, 
> TypedFilter, and AppendColumn
> -
>
> Key: SPARK-16898
> URL: https://issues.apache.org/jira/browse/SPARK-16898
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Priority: Minor
> Fix For: 2.1.0
>
>
> Typed logical plan like MapElements, TypedFilter, and AppendColumn contains a 
> closure field: {{func: (T) => Boolean}}. For example class TypedFilter's 
> signature is:
> {code}
> case class TypedFilter(
> func: AnyRef,
> deserializer: Expression,
> child: LogicalPlan) extends UnaryNode
> {code} 
> From the above class signature, we cannot easily find:
> 1. What is the input argument's type of the closure {{func}}? How do we know 
> which apply method to pick if there are multiple overloaded apply methods?
> 2. What is the input argument's schema? 
> With this info, it is easier for us to define some custom optimizer rule to 
> translate these typed logical plan to more efficient implementation, like the 
> closure optimization idea in SPARK-14083.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16898) Adds argument type information for typed logical plan like MapElements, TypedFilter, and AppendColumn

2016-08-08 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16898:

Assignee: Sean Zhong

> Adds argument type information for typed logical plan like MapElements, 
> TypedFilter, and AppendColumn
> -
>
> Key: SPARK-16898
> URL: https://issues.apache.org/jira/browse/SPARK-16898
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
> Fix For: 2.1.0
>
>
> Typed logical plan like MapElements, TypedFilter, and AppendColumn contains a 
> closure field: {{func: (T) => Boolean}}. For example class TypedFilter's 
> signature is:
> {code}
> case class TypedFilter(
> func: AnyRef,
> deserializer: Expression,
> child: LogicalPlan) extends UnaryNode
> {code} 
> From the above class signature, we cannot easily find:
> 1. What is the input argument's type of the closure {{func}}? How do we know 
> which apply method to pick if there are multiple overloaded apply methods?
> 2. What is the input argument's schema? 
> With this info, it is easier for us to define some custom optimizer rule to 
> translate these typed logical plan to more efficient implementation, like the 
> closure optimization idea in SPARK-14083.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16961) Utils.randomizeInPlace does not shuffle arrays uniformly

2016-08-08 Thread Nicholas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412705#comment-15412705
 ] 

Nicholas commented on SPARK-16961:
--

I am submitting a PR

> Utils.randomizeInPlace does not shuffle arrays uniformly
> 
>
> Key: SPARK-16961
> URL: https://issues.apache.org/jira/browse/SPARK-16961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Nicholas
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The Utils.randomizeInPlace method, which is meant to uniformly shuffle the 
> elements on an input array, never shuffles elements to their starting 
> position. That is, every permutation of the input array is equally likely to 
> be returned, except for any permutation in which any element is in the same 
> position where it began. These permutations are never output.
> This is because line 827 of Utils.scala should be
> {{val j = rand.nextInt(i + 1)}}
> instead of
> {{val j = rand.nextInt( i )}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16961) Utils.randomizeInPlace does not shuffle arrays uniformly

2016-08-08 Thread Nicholas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas updated SPARK-16961:
-
Description: 
The Utils.randomizeInPlace method, which is meant to uniformly shuffle the 
elements on an input array, never shuffles elements to their starting position. 
That is, every permutation of the input array is equally likely to be returned, 
except for any permutation in which any element is in the same position where 
it began. These permutations are never output.
This is because line 827 of Utils.scala should be
{{val j = rand.nextInt(i + 1)}}
instead of
{{val j = rand.nextInt( i )}}

  was:
The Utils.randomizeInPlace method, which is meant to uniformly shuffle the 
elements on an input array, never shuffles elements to their starting position. 
That is, every permutation of the input array is equally likely to be returned, 
except for any permutation in which any element is in the same position where 
it began. These permutations are never output.
This is because line 827 of Utils.scala should be
{{val j = rand.nextInt(i + 1)}}
instead of
{{val j = rand.nextInt(i)}}


> Utils.randomizeInPlace does not shuffle arrays uniformly
> 
>
> Key: SPARK-16961
> URL: https://issues.apache.org/jira/browse/SPARK-16961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Nicholas
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The Utils.randomizeInPlace method, which is meant to uniformly shuffle the 
> elements on an input array, never shuffles elements to their starting 
> position. That is, every permutation of the input array is equally likely to 
> be returned, except for any permutation in which any element is in the same 
> position where it began. These permutations are never output.
> This is because line 827 of Utils.scala should be
> {{val j = rand.nextInt(i + 1)}}
> instead of
> {{val j = rand.nextInt( i )}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16961) Utils.randomizeInPlace does not shuffle arrays uniformly

2016-08-08 Thread Nicholas (JIRA)
Nicholas created SPARK-16961:


 Summary: Utils.randomizeInPlace does not shuffle arrays uniformly
 Key: SPARK-16961
 URL: https://issues.apache.org/jira/browse/SPARK-16961
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Nicholas
Priority: Minor


The Utils.randomizeInPlace method, which is meant to uniformly shuffle the 
elements on an input array, never shuffles elements to their starting position. 
That is, every permutation of the input array is equally likely to be returned, 
except for any permutation in which any element is in the same position where 
it began. These permutations are never output.
This is because line 827 of Utils.scala should be
{{val j = rand.nextInt(i + 1)}}
instead of
{{val j = rand.nextInt(i)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16960) Deprecate approxCountDistinct, toDegrees and toRadians according to FunctionRegistry in Scala and Python

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16960:


Assignee: (was: Apache Spark)

> Deprecate approxCountDistinct, toDegrees and toRadians according to 
> FunctionRegistry in Scala and Python
> 
>
> Key: SPARK-16960
> URL: https://issues.apache.org/jira/browse/SPARK-16960
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It seems {{approxCountDistinct}}, {{toDegrees}} and {{toRadians}} are also 
> missed while matching the names to the ones in {{FunctionRegistry}}. (please 
> see 
> [approx_count_distinct|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L244],
>  
> [degrees|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L203]
>  and 
> [radians|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L222]
>  in `FunctionRegistry`).
> I took a scan between {{functions.scala}} and {{FunctionRegistry}} and it 
> seems these are all left. For {{countDistinct}} and {{sumDistinct}}, they are 
> not registered in {{FunctionRegistry}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16960) Deprecate approxCountDistinct, toDegrees and toRadians according to FunctionRegistry in Scala and Python

2016-08-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412695#comment-15412695
 ] 

Apache Spark commented on SPARK-16960:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14538

> Deprecate approxCountDistinct, toDegrees and toRadians according to 
> FunctionRegistry in Scala and Python
> 
>
> Key: SPARK-16960
> URL: https://issues.apache.org/jira/browse/SPARK-16960
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It seems {{approxCountDistinct}}, {{toDegrees}} and {{toRadians}} are also 
> missed while matching the names to the ones in {{FunctionRegistry}}. (please 
> see 
> [approx_count_distinct|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L244],
>  
> [degrees|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L203]
>  and 
> [radians|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L222]
>  in `FunctionRegistry`).
> I took a scan between {{functions.scala}} and {{FunctionRegistry}} and it 
> seems these are all left. For {{countDistinct}} and {{sumDistinct}}, they are 
> not registered in {{FunctionRegistry}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16960) Deprecate approxCountDistinct, toDegrees and toRadians according to FunctionRegistry in Scala and Python

2016-08-08 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-16960:


 Summary: Deprecate approxCountDistinct, toDegrees and toRadians 
according to FunctionRegistry in Scala and Python
 Key: SPARK-16960
 URL: https://issues.apache.org/jira/browse/SPARK-16960
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon
Priority: Minor


It seems {{approxCountDistinct}}, {{toDegrees}} and {{toRadians}} are also 
missed while matching the names to the ones in {{FunctionRegistry}}. (please 
see 
[approx_count_distinct|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L244],
 
[degrees|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L203]
 and 
[radians|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L222]
 in `FunctionRegistry`).

I took a scan between {{functions.scala}} and {{FunctionRegistry}} and it seems 
these are all left. For {{countDistinct}} and {{sumDistinct}}, they are not 
registered in {{FunctionRegistry}}.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16960) Deprecate approxCountDistinct, toDegrees and toRadians according to FunctionRegistry in Scala and Python

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16960:


Assignee: Apache Spark

> Deprecate approxCountDistinct, toDegrees and toRadians according to 
> FunctionRegistry in Scala and Python
> 
>
> Key: SPARK-16960
> URL: https://issues.apache.org/jira/browse/SPARK-16960
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> It seems {{approxCountDistinct}}, {{toDegrees}} and {{toRadians}} are also 
> missed while matching the names to the ones in {{FunctionRegistry}}. (please 
> see 
> [approx_count_distinct|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L244],
>  
> [degrees|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L203]
>  and 
> [radians|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L222]
>  in `FunctionRegistry`).
> I took a scan between {{functions.scala}} and {{FunctionRegistry}} and it 
> seems these are all left. For {{countDistinct}} and {{sumDistinct}}, they are 
> not registered in {{FunctionRegistry}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16959) Table Comment in the CatalogTable returned from HiveMetastore is Always Empty

2016-08-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412658#comment-15412658
 ] 

Apache Spark commented on SPARK-16959:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14550

> Table Comment in the CatalogTable returned from HiveMetastore is Always Empty
> -
>
> Key: SPARK-16959
> URL: https://issues.apache.org/jira/browse/SPARK-16959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> The `comment` in `CatalogTable` returned from Hive is always empty. We store 
> it in the table property when creating a table. However, when we try to 
> retrieve the table metadata from Hive metastore, we do not rebuild it. The 
> `comment` is always empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16959) Table Comment in the CatalogTable returned from HiveMetastore is Always Empty

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16959:


Assignee: (was: Apache Spark)

> Table Comment in the CatalogTable returned from HiveMetastore is Always Empty
> -
>
> Key: SPARK-16959
> URL: https://issues.apache.org/jira/browse/SPARK-16959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> The `comment` in `CatalogTable` returned from Hive is always empty. We store 
> it in the table property when creating a table. However, when we try to 
> retrieve the table metadata from Hive metastore, we do not rebuild it. The 
> `comment` is always empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16959) Table Comment in the CatalogTable returned from HiveMetastore is Always Empty

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16959:


Assignee: Apache Spark

> Table Comment in the CatalogTable returned from HiveMetastore is Always Empty
> -
>
> Key: SPARK-16959
> URL: https://issues.apache.org/jira/browse/SPARK-16959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> The `comment` in `CatalogTable` returned from Hive is always empty. We store 
> it in the table property when creating a table. However, when we try to 
> retrieve the table metadata from Hive metastore, we do not rebuild it. The 
> `comment` is always empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16959) Table Comment in the CatalogTable returned from HiveMetastore is Always Empty

2016-08-08 Thread Xiao Li (JIRA)
Xiao Li created SPARK-16959:
---

 Summary: Table Comment in the CatalogTable returned from 
HiveMetastore is Always Empty
 Key: SPARK-16959
 URL: https://issues.apache.org/jira/browse/SPARK-16959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


The `comment` in `CatalogTable` returned from Hive is always empty. We store it 
in the table property when creating a table. However, when we try to retrieve 
the table metadata from Hive metastore, we do not rebuild it. The `comment` is 
always empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16749) Clean-up OffsetWindowFrame

2016-08-08 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16749.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14376
[https://github.com/apache/spark/pull/14376]

> Clean-up OffsetWindowFrame
> --
>
> Key: SPARK-16749
> URL: https://issues.apache.org/jira/browse/SPARK-16749
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.1.0
>
>
> The code in OffsetWindowFrame can be a bit more streamlined and quicker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12909) Spark on Mesos accessing Secured HDFS w/Kerberos

2016-08-08 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412624#comment-15412624
 ] 

Michael Gummelt commented on SPARK-12909:
-

DC/OS Spark has this functionality, and we'll be upstreaming it to Apache Spark 
soon.

> Spark on Mesos accessing Secured HDFS w/Kerberos
> 
>
> Key: SPARK-12909
> URL: https://issues.apache.org/jira/browse/SPARK-12909
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Greg Senia
>
> Ability for Spark on Mesos to use a Kerberized HDFS FileSystem for data It 
> seems like this is not possible based on email chains and forum articles? If 
> these are true how hard would it be to get this implemented I'm willing to 
> try to help.
> https://community.hortonworks.com/questions/5415/spark-on-yarn-vs-mesos.html
> https://www.mail-archive.com/user@spark.apache.org/msg31326.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16779) Fix unnecessary use of postfix operations

2016-08-08 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16779.

   Resolution: Fixed
 Assignee: holdenk
Fix Version/s: 2.1.0

> Fix unnecessary use of postfix operations
> -
>
> Key: SPARK-16779
> URL: https://issues.apache.org/jira/browse/SPARK-16779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>Assignee: holdenk
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16958) Reuse subqueries within single query

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16958:


Assignee: Apache Spark  (was: Davies Liu)

> Reuse subqueries within single query
> 
>
> Key: SPARK-16958
> URL: https://issues.apache.org/jira/browse/SPARK-16958
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> There could be same subquery within a single query, we could reuse the result 
> without running it multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16958) Reuse subqueries within single query

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16958:


Assignee: Davies Liu  (was: Apache Spark)

> Reuse subqueries within single query
> 
>
> Key: SPARK-16958
> URL: https://issues.apache.org/jira/browse/SPARK-16958
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> There could be same subquery within a single query, we could reuse the result 
> without running it multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16958) Reuse subqueries within single query

2016-08-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412578#comment-15412578
 ] 

Apache Spark commented on SPARK-16958:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/14548

> Reuse subqueries within single query
> 
>
> Key: SPARK-16958
> URL: https://issues.apache.org/jira/browse/SPARK-16958
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> There could be same subquery within a single query, we could reuse the result 
> without running it multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16958) Reuse subqueries within single query

2016-08-08 Thread Davies Liu (JIRA)
Davies Liu created SPARK-16958:
--

 Summary: Reuse subqueries within single query
 Key: SPARK-16958
 URL: https://issues.apache.org/jira/browse/SPARK-16958
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


There could be same subquery within a single query, we could reuse the result 
without running it multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11150) Dynamic partition pruning

2016-08-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-11150:
---
Assignee: (was: Davies Liu)

> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 2.0.0
>Reporter: Younes
>
> Partitions are not pruned when joined on the partition columns.
> This is the same issue as HIVE-9152.
> Ex: 
> Select  from tab where partcol=1 will prune on value 1
> Select  from tab join dim on (dim.partcol=tab.partcol) where 
> dim.partcol=1 will scan all partitions.
> Tables are based on parquets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11150) Dynamic partition pruning

2016-08-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-11150:
---
Target Version/s:   (was: 2.1.0)

> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 2.0.0
>Reporter: Younes
>Assignee: Davies Liu
>
> Partitions are not pruned when joined on the partition columns.
> This is the same issue as HIVE-9152.
> Ex: 
> Select  from tab where partcol=1 will prune on value 1
> Select  from tab join dim on (dim.partcol=tab.partcol) where 
> dim.partcol=1 will scan all partitions.
> Tables are based on parquets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16718) gbm-style treeboost

2016-08-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412565#comment-15412565
 ] 

Apache Spark commented on SPARK-16718:
--

User 'vlad17' has created a pull request for this issue:
https://github.com/apache/spark/pull/14547

> gbm-style treeboost
> ---
>
> Key: SPARK-16718
> URL: https://issues.apache.org/jira/browse/SPARK-16718
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Assignee: Vladimir Feinberg
>
> .As an initial minimal change, we should provide TreeBoost as implemented in 
> GBM for L1, L2, and logistic losses: by introducing a new "loss-based" 
> impurity, tree leafs in GBTs can have loss-optimal predictions for their 
> partition of the data.
> Commit should have evidence of accuracy improvment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16718) gbm-style treeboost

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16718:


Assignee: Vladimir Feinberg  (was: Apache Spark)

> gbm-style treeboost
> ---
>
> Key: SPARK-16718
> URL: https://issues.apache.org/jira/browse/SPARK-16718
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Assignee: Vladimir Feinberg
>
> .As an initial minimal change, we should provide TreeBoost as implemented in 
> GBM for L1, L2, and logistic losses: by introducing a new "loss-based" 
> impurity, tree leafs in GBTs can have loss-optimal predictions for their 
> partition of the data.
> Commit should have evidence of accuracy improvment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16718) gbm-style treeboost

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16718:


Assignee: Apache Spark  (was: Vladimir Feinberg)

> gbm-style treeboost
> ---
>
> Key: SPARK-16718
> URL: https://issues.apache.org/jira/browse/SPARK-16718
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Assignee: Apache Spark
>
> .As an initial minimal change, we should provide TreeBoost as implemented in 
> GBM for L1, L2, and logistic losses: by introducing a new "loss-based" 
> impurity, tree leafs in GBTs can have loss-optimal predictions for their 
> partition of the data.
> Commit should have evidence of accuracy improvment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16955:


Assignee: (was: Apache Spark)

> Using ordinals in ORDER BY causes an analysis error when the query has a 
> GROUP BY clause using ordinals
> ---
>
> Key: SPARK-16955
> URL: https://issues.apache.org/jira/browse/SPARK-16955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> The following queries work
> {code}
> select a from (select 1 as a) tmp order by 1
> select a, count(*) from (select 1 as a) tmp group by 1
> select a, count(*) from (select 1 as a) tmp group by 1 order by a
> {code}
> However, the following query does not
> {code}
> select a, count(*) from (select 1 as a) tmp group by 1 order by 1
> {code}
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> Group by position: '1' exceeds the size of the select list '0'. on unresolved 
> object, tree:
> Aggregate [1]
> +- SubqueryAlias tmp
>+- Project [1 AS a#82]
>   +- OneRowRelation$
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181)
>   at 
> 

[jira] [Assigned] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16955:


Assignee: Apache Spark

> Using ordinals in ORDER BY causes an analysis error when the query has a 
> GROUP BY clause using ordinals
> ---
>
> Key: SPARK-16955
> URL: https://issues.apache.org/jira/browse/SPARK-16955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> The following queries work
> {code}
> select a from (select 1 as a) tmp order by 1
> select a, count(*) from (select 1 as a) tmp group by 1
> select a, count(*) from (select 1 as a) tmp group by 1 order by a
> {code}
> However, the following query does not
> {code}
> select a, count(*) from (select 1 as a) tmp group by 1 order by 1
> {code}
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> Group by position: '1' exceeds the size of the select list '0'. on unresolved 
> object, tree:
> Aggregate [1]
> +- SubqueryAlias tmp
>+- Project [1 AS a#82]
>   +- OneRowRelation$
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181)
>   at 
> 

[jira] [Commented] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals

2016-08-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412551#comment-15412551
 ] 

Apache Spark commented on SPARK-16955:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/14546

> Using ordinals in ORDER BY causes an analysis error when the query has a 
> GROUP BY clause using ordinals
> ---
>
> Key: SPARK-16955
> URL: https://issues.apache.org/jira/browse/SPARK-16955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> The following queries work
> {code}
> select a from (select 1 as a) tmp order by 1
> select a, count(*) from (select 1 as a) tmp group by 1
> select a, count(*) from (select 1 as a) tmp group by 1 order by a
> {code}
> However, the following query does not
> {code}
> select a, count(*) from (select 1 as a) tmp group by 1 order by 1
> {code}
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> Group by position: '1' exceeds the size of the select list '0'. on unresolved 
> object, tree:
> Aggregate [1]
> +- SubqueryAlias tmp
>+- Project [1 AS a#82]
>   +- OneRowRelation$
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181)
>  

[jira] [Resolved] (SPARK-12326) Move GBT implementation from spark.mllib to spark.ml

2016-08-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12326.
---
Resolution: Done

> Move GBT implementation from spark.mllib to spark.ml
> 
>
> Key: SPARK-12326
> URL: https://issues.apache.org/jira/browse/SPARK-12326
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
>
> Several improvements can be made to gradient boosted trees, but are not 
> possible without moving the GBT implementation to spark.ml (e.g. 
> rawPrediction column, feature importance). This Jira is for moving the 
> current GBT implementation to spark.ml, which will have roughly the following 
> steps:
> 1. Copy the implementation to spark.ml and change spark.ml classes to use 
> that implementation. Current tests will ensure that the implementations learn 
> exactly the same models. 
> 2. Move the decision tree helper classes over to spark.ml (e.g. Impurity, 
> InformationGainStats, ImpurityStats, DTStatsAggregator, etc...). Since 
> eventually all tree implementations will reside in spark.ml, the helper 
> classes should as well.
> 3. Remove the spark.mllib implementation, and make the spark.mllib APIs 
> wrappers around the spark.ml implementation. The spark.ml tests will again 
> ensure that we do not change any behavior.
> 4. Move the unit tests to spark.ml, and change the spark.mllib unit tests to 
> verify model equivalence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12383) Move unit tests for GBT from spark.mllib to spark.ml

2016-08-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12383.
---
Resolution: Duplicate

> Move unit tests for GBT from spark.mllib to spark.ml
> 
>
> Key: SPARK-12383
> URL: https://issues.apache.org/jira/browse/SPARK-12383
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> After the GBT implementation is moved from MLlib to ML, we should move the 
> unit tests to ML as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12326) Move GBT implementation from spark.mllib to spark.ml

2016-08-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12326:
--
Priority: Minor  (was: Major)

> Move GBT implementation from spark.mllib to spark.ml
> 
>
> Key: SPARK-12326
> URL: https://issues.apache.org/jira/browse/SPARK-12326
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
>
> Several improvements can be made to gradient boosted trees, but are not 
> possible without moving the GBT implementation to spark.ml (e.g. 
> rawPrediction column, feature importance). This Jira is for moving the 
> current GBT implementation to spark.ml, which will have roughly the following 
> steps:
> 1. Copy the implementation to spark.ml and change spark.ml classes to use 
> that implementation. Current tests will ensure that the implementations learn 
> exactly the same models. 
> 2. Move the decision tree helper classes over to spark.ml (e.g. Impurity, 
> InformationGainStats, ImpurityStats, DTStatsAggregator, etc...). Since 
> eventually all tree implementations will reside in spark.ml, the helper 
> classes should as well.
> 3. Remove the spark.mllib implementation, and make the spark.mllib APIs 
> wrappers around the spark.ml implementation. The spark.ml tests will again 
> ensure that we do not change any behavior.
> 4. Move the unit tests to spark.ml, and change the spark.mllib unit tests to 
> verify model equivalence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private

2016-08-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12381.
---
Resolution: Duplicate

> Copy public decision tree helper classes from spark.mllib to spark.ml and 
> make private
> --
>
> Key: SPARK-12381
> URL: https://issues.apache.org/jira/browse/SPARK-12381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> The helper classes for decision trees and decision tree ensembles (e.g. 
> Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) 
> currently reside in spark.mllib, but as the algorithm implementations are 
> moved to spark.ml, so should these helper classes.
> We should take this opportunity to make some of those helper classes private 
> when possible (especially if they are only needed during training) and maybe 
> change the APIs (especially if we can eliminate duplicate data stored in the 
> final model).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private

2016-08-08 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412476#comment-15412476
 ] 

Seth Hendrickson commented on SPARK-12381:
--

I haven't looked at this in a while. Please feel free to take it over.

> Copy public decision tree helper classes from spark.mllib to spark.ml and 
> make private
> --
>
> Key: SPARK-12381
> URL: https://issues.apache.org/jira/browse/SPARK-12381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> The helper classes for decision trees and decision tree ensembles (e.g. 
> Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) 
> currently reside in spark.mllib, but as the algorithm implementations are 
> moved to spark.ml, so should these helper classes.
> We should take this opportunity to make some of those helper classes private 
> when possible (especially if they are only needed during training) and maybe 
> change the APIs (especially if we can eliminate duplicate data stored in the 
> final model).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals

2016-08-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412475#comment-15412475
 ] 

Dongjoon Hyun commented on SPARK-16955:
---

`ResolveAggregateFunctions` seems to have a bug to drop the ordinals. I'll make 
a PR after some testing.

> Using ordinals in ORDER BY causes an analysis error when the query has a 
> GROUP BY clause using ordinals
> ---
>
> Key: SPARK-16955
> URL: https://issues.apache.org/jira/browse/SPARK-16955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> The following queries work
> {code}
> select a from (select 1 as a) tmp order by 1
> select a, count(*) from (select 1 as a) tmp group by 1
> select a, count(*) from (select 1 as a) tmp group by 1 order by a
> {code}
> However, the following query does not
> {code}
> select a, count(*) from (select 1 as a) tmp group by 1 order by 1
> {code}
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> Group by position: '1' exceeds the size of the select list '0'. on unresolved 
> object, tree:
> Aggregate [1]
> +- SubqueryAlias tmp
>+- Project [1 AS a#82]
>   +- OneRowRelation$
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181)
> 

[jira] [Commented] (SPARK-11150) Dynamic partition pruning

2016-08-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412468#comment-15412468
 ] 

Apache Spark commented on SPARK-11150:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/14545

> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 2.0.0
>Reporter: Younes
>Assignee: Davies Liu
>
> Partitions are not pruned when joined on the partition columns.
> This is the same issue as HIVE-9152.
> Ex: 
> Select  from tab where partcol=1 will prune on value 1
> Select  from tab join dim on (dim.partcol=tab.partcol) where 
> dim.partcol=1 will scan all partitions.
> Tables are based on parquets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16957) Use weighted midpoints for split values.

2016-08-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-16957:
--
Priority: Trivial  (was: Major)

> Use weighted midpoints for split values.
> 
>
> Key: SPARK-16957
> URL: https://issues.apache.org/jira/browse/SPARK-16957
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Priority: Trivial
>
> Just like R's gbm, we should be using weighted split points rather than the 
> actual continuous binned feature values. For instance, in a dataset 
> containing binary features (that are fed in as continuous ones), our splits 
> are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some 
> smoothness qualities, this is asymptotically bad compared to GBM's approach. 
> The split point should be a weighted split point of the two values of the 
> "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, 
> the above split should be at {{0.75}}.
> Example:
> {code}
> +++-+-+
> |feature0|feature1|label|count|
> +++-+-+
> | 0.0| 0.0|  0.0|   23|
> | 1.0| 0.0|  0.0|2|
> | 0.0| 0.0|  1.0|2|
> | 0.0| 1.0|  0.0|7|
> | 1.0| 0.0|  1.0|   23|
> | 0.0| 1.0|  1.0|   18|
> | 1.0| 1.0|  1.0|7|
> | 1.0| 1.0|  0.0|   18|
> +++-+-+
> DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
>   If (feature 0 <= 0.0)
>If (feature 1 <= 0.0)
> Predict: -0.56
>Else (feature 1 > 0.0)
> Predict: 0.29333
>   Else (feature 0 > 0.0)
>If (feature 1 <= 0.0)
> Predict: 0.56
>Else (feature 1 > 0.0)
> Predict: -0.29333
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11638) Run Spark on Mesos with bridge networking

2016-08-08 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412461#comment-15412461
 ] 

Michael Gummelt commented on SPARK-11638:
-

[~radekg]

> The only advantage we had was using the same configuration inside of the 
> docker container.

You mean you want to run the spark driver in a docker container?  Which 
configuration did you have to change?  I can look more into this, but I need a 
clear "It's easier/better to do X in bridge mode than in host mode".

> So with the HTTP API, Spark would still require the heavy libmesos in order 
> to work with Mesos?

No.  The HTTP API will remove the libmesos dependency, which is nice.  It's not 
an urgent priority though. 

> Run Spark on Mesos with bridge networking
> -
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 
> used by the Master to find the service on. The services are:
> - {{spark.broadcast.port}}
> - {{spark.fileserver.port}}
> - {{spark.replClassServer.port}}
> all above ports are by default {{0}} (random assignment) but can be specified 
> using Spark configuration ( {{-Dspark...port}} ). However, they are limited 
> in the same way as the {{spark.driver.port}}; in the above example, an 
> executor should not contact the file server on port {{6677}} but rather on 
> the respective 31xxx assigned by Mesos.
> Spark currently does not allow any of that.
> h4. Taking on the problem, step 1: Spark Driver
> As mentioned above, Spark 

[jira] [Commented] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private

2016-08-08 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412459#comment-15412459
 ] 

Vladimir Feinberg commented on SPARK-12381:
---

Yeah, that'd be a good idea.

> Copy public decision tree helper classes from spark.mllib to spark.ml and 
> make private
> --
>
> Key: SPARK-12381
> URL: https://issues.apache.org/jira/browse/SPARK-12381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> The helper classes for decision trees and decision tree ensembles (e.g. 
> Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) 
> currently reside in spark.mllib, but as the algorithm implementations are 
> moved to spark.ml, so should these helper classes.
> We should take this opportunity to make some of those helper classes private 
> when possible (especially if they are only needed during training) and maybe 
> change the APIs (especially if we can eliminate duplicate data stored in the 
> final model).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16957) Use weighted midpoints for split values.

2016-08-08 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16957:
--
Issue Type: Improvement  (was: Sub-task)
Parent: (was: SPARK-14045)

> Use weighted midpoints for split values.
> 
>
> Key: SPARK-16957
> URL: https://issues.apache.org/jira/browse/SPARK-16957
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>
> Just like R's gbm, we should be using weighted split points rather than the 
> actual continuous binned feature values. For instance, in a dataset 
> containing binary features (that are fed in as continuous ones), our splits 
> are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some 
> smoothness qualities, this is asymptotically bad compared to GBM's approach. 
> The split point should be a weighted split point of the two values of the 
> "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, 
> the above split should be at {{0.75}}.
> Example:
> {code}
> +++-+-+
> |feature0|feature1|label|count|
> +++-+-+
> | 0.0| 0.0|  0.0|   23|
> | 1.0| 0.0|  0.0|2|
> | 0.0| 0.0|  1.0|2|
> | 0.0| 1.0|  0.0|7|
> | 1.0| 0.0|  1.0|   23|
> | 0.0| 1.0|  1.0|   18|
> | 1.0| 1.0|  1.0|7|
> | 1.0| 1.0|  0.0|   18|
> +++-+-+
> DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
>   If (feature 0 <= 0.0)
>If (feature 1 <= 0.0)
> Predict: -0.56
>Else (feature 1 > 0.0)
> Predict: 0.29333
>   Else (feature 0 > 0.0)
>If (feature 1 <= 0.0)
> Predict: 0.56
>Else (feature 1 > 0.0)
> Predict: -0.29333
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16957) Use weighted midpoints for split values.

2016-08-08 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16957:
-

 Summary: Use weighted midpoints for split values.
 Key: SPARK-16957
 URL: https://issues.apache.org/jira/browse/SPARK-16957
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Vladimir Feinberg


Just like R's gbm, we should be using weighted split points rather than the 
actual continuous binned feature values. For instance, in a dataset containing 
binary features (that are fed in as continuous ones), our splits are selected 
as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some smoothness 
qualities, this is asymptotically bad compared to GBM's approach. The split 
point should be a weighted split point of the two values of the "innermost" 
feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split 
should be at {{0.75}}.

Example:
{code}
+++-+-+
|feature0|feature1|label|count|
+++-+-+
| 0.0| 0.0|  0.0|   23|
| 1.0| 0.0|  0.0|2|
| 0.0| 0.0|  1.0|2|
| 0.0| 1.0|  0.0|7|
| 1.0| 0.0|  1.0|   23|
| 0.0| 1.0|  1.0|   18|
| 1.0| 1.0|  1.0|7|
| 1.0| 1.0|  0.0|   18|
+++-+-+

DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
  If (feature 0 <= 0.0)
   If (feature 1 <= 0.0)
Predict: -0.56
   Else (feature 1 > 0.0)
Predict: 0.29333
  Else (feature 0 > 0.0)
   If (feature 1 <= 0.0)
Predict: 0.56
   Else (feature 1 > 0.0)
Predict: -0.29333
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private

2016-08-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412451#comment-15412451
 ] 

Sean Owen commented on SPARK-12381:
---

Are you also basically subsuming 
https://issues.apache.org/jira/browse/SPARK-12383 ? I'd like to mark all of 
these as duplicates then, since they don't have activity.

> Copy public decision tree helper classes from spark.mllib to spark.ml and 
> make private
> --
>
> Key: SPARK-12381
> URL: https://issues.apache.org/jira/browse/SPARK-12381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> The helper classes for decision trees and decision tree ensembles (e.g. 
> Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) 
> currently reside in spark.mllib, but as the algorithm implementations are 
> moved to spark.ml, so should these helper classes.
> We should take this opportunity to make some of those helper classes private 
> when possible (especially if they are only needed during training) and maybe 
> change the APIs (especially if we can eliminate duplicate data stored in the 
> final model).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private

2016-08-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12381:
--
Comment: was deleted

(was: I will be out of the office until Monday 15 August. I will not have 
regular access to email during this time, but will respond upon my return. For 
any urgent enquiries, please contact Frederick Kruger at 
frederick.kru...@quantium.com.au, or call the office on +61 2 9292 6400.
)

> Copy public decision tree helper classes from spark.mllib to spark.ml and 
> make private
> --
>
> Key: SPARK-12381
> URL: https://issues.apache.org/jira/browse/SPARK-12381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> The helper classes for decision trees and decision tree ensembles (e.g. 
> Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) 
> currently reside in spark.mllib, but as the algorithm implementations are 
> moved to spark.ml, so should these helper classes.
> We should take this opportunity to make some of those helper classes private 
> when possible (especially if they are only needed during training) and maybe 
> change the APIs (especially if we can eliminate duplicate data stored in the 
> final model).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11150) Dynamic partition pruning

2016-08-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-11150:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.1.0
   Issue Type: New Feature  (was: Bug)

> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 2.0.0
>Reporter: Younes
>Assignee: Davies Liu
>
> Partitions are not pruned when joined on the partition columns.
> This is the same issue as HIVE-9152.
> Ex: 
> Select  from tab where partcol=1 will prune on value 1
> Select  from tab join dim on (dim.partcol=tab.partcol) where 
> dim.partcol=1 will scan all partitions.
> Tables are based on parquets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11150) Dynamic partition pruning

2016-08-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-11150:
--

Assignee: Davies Liu

> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Younes
>Assignee: Davies Liu
>
> Partitions are not pruned when joined on the partition columns.
> This is the same issue as HIVE-9152.
> Ex: 
> Select  from tab where partcol=1 will prune on value 1
> Select  from tab join dim on (dim.partcol=tab.partcol) where 
> dim.partcol=1 will scan all partitions.
> Tables are based on parquets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private

2016-08-08 Thread Matthew Carle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412439#comment-15412439
 ] 

Matthew Carle commented on SPARK-12381:
---

I will be out of the office until Monday 15 August. I will not have regular 
access to email during this time, but will respond upon my return. For any 
urgent enquiries, please contact Frederick Kruger at 
frederick.kru...@quantium.com.au, or call the office on +61 2 9292 6400.


> Copy public decision tree helper classes from spark.mllib to spark.ml and 
> make private
> --
>
> Key: SPARK-12381
> URL: https://issues.apache.org/jira/browse/SPARK-12381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> The helper classes for decision trees and decision tree ensembles (e.g. 
> Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) 
> currently reside in spark.mllib, but as the algorithm implementations are 
> moved to spark.ml, so should these helper classes.
> We should take this opportunity to make some of those helper classes private 
> when possible (especially if they are only needed during training) and maybe 
> change the APIs (especially if we can eliminate duplicate data stored in the 
> final model).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private

2016-08-08 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412438#comment-15412438
 ] 

Vladimir Feinberg commented on SPARK-12381:
---

[~sethah] Just so we don't clash, I think these two JIRAs are overlapping: 
SPARK-16728

> Copy public decision tree helper classes from spark.mllib to spark.ml and 
> make private
> --
>
> Key: SPARK-12381
> URL: https://issues.apache.org/jira/browse/SPARK-12381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> The helper classes for decision trees and decision tree ensembles (e.g. 
> Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) 
> currently reside in spark.mllib, but as the algorithm implementations are 
> moved to spark.ml, so should these helper classes.
> We should take this opportunity to make some of those helper classes private 
> when possible (especially if they are only needed during training) and maybe 
> change the APIs (especially if we can eliminate duplicate data stored in the 
> final model).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16953) Make requestTotalExecutors public to be consistent with requestExecutors/killExecutors

2016-08-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-16953.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14541
[https://github.com/apache/spark/pull/14541]

> Make requestTotalExecutors public to be consistent with 
> requestExecutors/killExecutors
> --
>
> Key: SPARK-16953
> URL: https://issues.apache.org/jira/browse/SPARK-16953
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16956) Make ApplicationState.MAX_NUM_RETRY configurable

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16956:


Assignee: Josh Rosen  (was: Apache Spark)

> Make ApplicationState.MAX_NUM_RETRY configurable
> 
>
> Key: SPARK-16956
> URL: https://issues.apache.org/jira/browse/SPARK-16956
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The {{ApplicationState.MAX_NUM_RETRY}} setting, which controls the maximum 
> number of back-to-back executor failures that the standalone cluster manager 
> will tolerate before removing a faulty application, is currently a hardcoded 
> constant (10), but there are use-cases for making it configurable (TBD in my 
> PR). We should add a new configuration key to let users customize this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16956) Make ApplicationState.MAX_NUM_RETRY configurable

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16956:


Assignee: Apache Spark  (was: Josh Rosen)

> Make ApplicationState.MAX_NUM_RETRY configurable
> 
>
> Key: SPARK-16956
> URL: https://issues.apache.org/jira/browse/SPARK-16956
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> The {{ApplicationState.MAX_NUM_RETRY}} setting, which controls the maximum 
> number of back-to-back executor failures that the standalone cluster manager 
> will tolerate before removing a faulty application, is currently a hardcoded 
> constant (10), but there are use-cases for making it configurable (TBD in my 
> PR). We should add a new configuration key to let users customize this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16956) Make ApplicationState.MAX_NUM_RETRY configurable

2016-08-08 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-16956:
--

 Summary: Make ApplicationState.MAX_NUM_RETRY configurable
 Key: SPARK-16956
 URL: https://issues.apache.org/jira/browse/SPARK-16956
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Josh Rosen
Assignee: Josh Rosen


The {{ApplicationState.MAX_NUM_RETRY}} setting, which controls the maximum 
number of back-to-back executor failures that the standalone cluster manager 
will tolerate before removing a faulty application, is currently a hardcoded 
constant (10), but there are use-cases for making it configurable (TBD in my 
PR). We should add a new configuration key to let users customize this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11638) Run Spark on Mesos with bridge networking

2016-08-08 Thread Radoslaw Gruchalski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412293#comment-15412293
 ] 

Radoslaw Gruchalski commented on SPARK-11638:
-

[~mandoskippy] Yes, there my lack of knowledge regarding the API can be seen. 
Just read the 
http://events.linuxfoundation.org/sites/events/files/slides/Mesos_HTTP_API.pdf. 
Considering that the Mesos scheduler was changed to take advantage, might be 
the case. Older versions of Mesos would still require native library.

> Run Spark on Mesos with bridge networking
> -
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 
> used by the Master to find the service on. The services are:
> - {{spark.broadcast.port}}
> - {{spark.fileserver.port}}
> - {{spark.replClassServer.port}}
> all above ports are by default {{0}} (random assignment) but can be specified 
> using Spark configuration ( {{-Dspark...port}} ). However, they are limited 
> in the same way as the {{spark.driver.port}}; in the above example, an 
> executor should not contact the file server on port {{6677}} but rather on 
> the respective 31xxx assigned by Mesos.
> Spark currently does not allow any of that.
> h4. Taking on the problem, step 1: Spark Driver
> As mentioned above, Spark Driver is based on {{akka-remote}}. In order to 
> take on the problem, the {{akka.remote.net.tcp.bind-hostname}} and 
> {{akka.remote.net.tcp.bind-port}} settings are a must. Spark does not compile 
> with Akka 

[jira] [Updated] (SPARK-16552) Store the Inferred Schemas into External Catalog Tables when Creating Tables

2016-08-08 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16552:
-
Labels: release_notes releasenotes  (was: )

> Store the Inferred Schemas into External Catalog Tables when Creating Tables
> 
>
> Key: SPARK-16552
> URL: https://issues.apache.org/jira/browse/SPARK-16552
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>  Labels: release_notes, releasenotes
> Fix For: 2.1.0
>
>
> Currently, in Spark SQL, the initial creation of schema can be classified 
> into two groups. It is applicable to both Hive tables and Data Source tables:
> Group A. Users specify the schema. 
> Case 1 CREATE TABLE AS SELECT: the schema is determined by the result schema 
> of the SELECT clause. For example,
> {noformat}
> CREATE TABLE tab STORED AS TEXTFILE
> AS SELECT * from input
> {noformat}
> Case 2 CREATE TABLE: users explicitly specify the schema. For example,
> {noformat}
> CREATE TABLE jsonTable (_1 string, _2 string)
> USING org.apache.spark.sql.json
> {noformat}
> Group B. Spark SQL infer the schema at runtime.
> Case 3 CREATE TABLE. Users do not specify the schema but the path to the file 
> location. For example,
> {noformat}
> CREATE TABLE jsonTable 
> USING org.apache.spark.sql.json
> OPTIONS (path '${tempDir.getCanonicalPath}')
> {noformat}
> Now, Spark SQL does not store the inferred schema in the external catalog for 
> the cases in Group B. When users refreshing the metadata cache, accessing the 
> table at the first time after (re-)starting Spark, Spark SQL will infer the 
> schema and store the info in the metadata cache for improving the performance 
> of subsequent metadata requests. However, the runtime schema inference could 
> cause undesirable schema changes after each reboot of Spark.
> It is desirable to store the inferred schema in the external catalog when 
> creating the table. When users intend to refresh the schema, they issue 
> `REFRESH TABLE`. Spark SQL will infer the schema again based on the 
> previously specified table location and update/refresh the schema in the 
> external catalog and metadata cache. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16953) Make requestTotalExecutors public to be consistent with requestExecutors/killExecutors

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16953:


Assignee: Apache Spark  (was: Tathagata Das)

> Make requestTotalExecutors public to be consistent with 
> requestExecutors/killExecutors
> --
>
> Key: SPARK-16953
> URL: https://issues.apache.org/jira/browse/SPARK-16953
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Tathagata Das
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16953) Make requestTotalExecutors public to be consistent with requestExecutors/killExecutors

2016-08-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412290#comment-15412290
 ] 

Apache Spark commented on SPARK-16953:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/14541

> Make requestTotalExecutors public to be consistent with 
> requestExecutors/killExecutors
> --
>
> Key: SPARK-16953
> URL: https://issues.apache.org/jira/browse/SPARK-16953
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16953) Make requestTotalExecutors public to be consistent with requestExecutors/killExecutors

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16953:


Assignee: Tathagata Das  (was: Apache Spark)

> Make requestTotalExecutors public to be consistent with 
> requestExecutors/killExecutors
> --
>
> Key: SPARK-16953
> URL: https://issues.apache.org/jira/browse/SPARK-16953
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals

2016-08-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412265#comment-15412265
 ] 

Dongjoon Hyun commented on SPARK-16955:
---

Sure! Thank you, [~yhuai]. I'll take a look this.

> Using ordinals in ORDER BY causes an analysis error when the query has a 
> GROUP BY clause using ordinals
> ---
>
> Key: SPARK-16955
> URL: https://issues.apache.org/jira/browse/SPARK-16955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> The following queries work
> {code}
> select a from (select 1 as a) tmp order by 1
> select a, count(*) from (select 1 as a) tmp group by 1
> select a, count(*) from (select 1 as a) tmp group by 1 order by a
> {code}
> However, the following query does not
> {code}
> select a, count(*) from (select 1 as a) tmp group by 1 order by 1
> {code}
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> Group by position: '1' exceeds the size of the select list '0'. on unresolved 
> object, tree:
> Aggregate [1]
> +- SubqueryAlias tmp
>+- Project [1 AS a#82]
>   +- OneRowRelation$
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181)
>   at 
> 

[jira] [Commented] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals

2016-08-08 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412263#comment-15412263
 ] 

Yin Huai commented on SPARK-16955:
--

[~dongjoon] Will have time to take a look?

> Using ordinals in ORDER BY causes an analysis error when the query has a 
> GROUP BY clause using ordinals
> ---
>
> Key: SPARK-16955
> URL: https://issues.apache.org/jira/browse/SPARK-16955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> The following queries work
> {code}
> select a from (select 1 as a) tmp order by 1
> select a, count(*) from (select 1 as a) tmp group by 1
> select a, count(*) from (select 1 as a) tmp group by 1 order by a
> {code}
> However, the following query does not
> {code}
> select a, count(*) from (select 1 as a) tmp group by 1 order by 1
> {code}
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> Group by position: '1' exceeds the size of the select list '0'. on unresolved 
> object, tree:
> Aggregate [1]
> +- SubqueryAlias tmp
>+- Project [1 AS a#82]
>   +- OneRowRelation$
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181)
>   at 
> 

[jira] [Created] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals

2016-08-08 Thread Yin Huai (JIRA)
Yin Huai created SPARK-16955:


 Summary: Using ordinals in ORDER BY causes an analysis error when 
the query has a GROUP BY clause using ordinals
 Key: SPARK-16955
 URL: https://issues.apache.org/jira/browse/SPARK-16955
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Yin Huai


The following queries work
{code}
select a from (select 1 as a) tmp order by 1
select a, count(*) from (select 1 as a) tmp group by 1
select a, count(*) from (select 1 as a) tmp group by 1 order by a
{code}

However, the following query does not
{code}
select a, count(*) from (select 1 as a) tmp group by 1 order by 1
{code}

{code}
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
Group by position: '1' exceeds the size of the select list '0'. on unresolved 
object, tree:
Aggregate [1]
+- SubqueryAlias tmp
   +- Project [1 AS a#82]
  +- OneRowRelation$

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at 

[jira] [Commented] (SPARK-14666) Using DISTINCT on a UDF (like CONCAT) is not supported

2016-08-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412250#comment-15412250
 ] 

Dongjoon Hyun commented on SPARK-14666:
---

Great!

> Using DISTINCT on a UDF (like CONCAT) is not supported
> --
>
> Key: SPARK-14666
> URL: https://issues.apache.org/jira/browse/SPARK-14666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Dominic Ricard
>Priority: Minor
> Fix For: 2.0.0
>
>
> The following query fails with:
> {noformat}
> Java::JavaSql::SQLException: org.apache.spark.sql.AnalysisException: cannot 
> resolve 'column_1' given input columns: [_c0]; line # pos ##
> {noformat}
> Query:
> {noformat}
> select
>   distinct concat(column_1, ' : ', column_2)
> from
>   table
> order by
>   concat(column_1, ' : ', column_2);
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14666) Using DISTINCT on a UDF (like CONCAT) is not supported

2016-08-08 Thread Dominic Ricard (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Ricard resolved SPARK-14666.

   Resolution: Fixed
Fix Version/s: 2.0.0

> Using DISTINCT on a UDF (like CONCAT) is not supported
> --
>
> Key: SPARK-14666
> URL: https://issues.apache.org/jira/browse/SPARK-14666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Dominic Ricard
>Priority: Minor
> Fix For: 2.0.0
>
>
> The following query fails with:
> {noformat}
> Java::JavaSql::SQLException: org.apache.spark.sql.AnalysisException: cannot 
> resolve 'column_1' given input columns: [_c0]; line # pos ##
> {noformat}
> Query:
> {noformat}
> select
>   distinct concat(column_1, ' : ', column_2)
> from
>   table
> order by
>   concat(column_1, ' : ', column_2);
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14666) Using DISTINCT on a UDF (like CONCAT) is not supported

2016-08-08 Thread Dominic Ricard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412243#comment-15412243
 ] 

Dominic Ricard commented on SPARK-14666:


It does indeed work in Spark 2.0. Thanks.

> Using DISTINCT on a UDF (like CONCAT) is not supported
> --
>
> Key: SPARK-14666
> URL: https://issues.apache.org/jira/browse/SPARK-14666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Dominic Ricard
>Priority: Minor
>
> The following query fails with:
> {noformat}
> Java::JavaSql::SQLException: org.apache.spark.sql.AnalysisException: cannot 
> resolve 'column_1' given input columns: [_c0]; line # pos ##
> {noformat}
> Query:
> {noformat}
> select
>   distinct concat(column_1, ' : ', column_2)
> from
>   table
> order by
>   concat(column_1, ' : ', column_2);
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16951) Alternative implementation of NOT IN to Anti-join

2016-08-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-16951:

Fix Version/s: (was: 2.1.0)

> Alternative implementation of NOT IN to Anti-join
> -
>
> Key: SPARK-16951
> URL: https://issues.apache.org/jira/browse/SPARK-16951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>
> A transformation currently used to process {{NOT IN}} subquery is to rewrite 
> to a form of Anti-join with null-aware property in the Logical Plan and then 
> translate to a form of {{OR}} predicate joining the parent side and the 
> subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} 
> predicate is limited to the nested-loop join execution plan, which will have 
> a major performance implication if both sides' results are large.
> This JIRA sketches an idea of changing the OR predicate to a form similar to 
> the technique used in the implementation of the Existence join that addresses 
> the problem of {{EXISTS (..) OR ..}} type of queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16951) Alternative implementation of NOT IN to Anti-join

2016-08-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-16951:

Target Version/s:   (was: 2.1.0)

> Alternative implementation of NOT IN to Anti-join
> -
>
> Key: SPARK-16951
> URL: https://issues.apache.org/jira/browse/SPARK-16951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>
> A transformation currently used to process {{NOT IN}} subquery is to rewrite 
> to a form of Anti-join with null-aware property in the Logical Plan and then 
> translate to a form of {{OR}} predicate joining the parent side and the 
> subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} 
> predicate is limited to the nested-loop join execution plan, which will have 
> a major performance implication if both sides' results are large.
> This JIRA sketches an idea of changing the OR predicate to a form similar to 
> the technique used in the implementation of the Existence join that addresses 
> the problem of {{EXISTS (..) OR ..}} type of queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16586) spark-class crash with "[: too many arguments" instead of displaying the correct error message

2016-08-08 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16586.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.1.0
   2.0.1

> spark-class crash with "[: too many arguments" instead of displaying the 
> correct error message
> --
>
> Key: SPARK-16586
> URL: https://issues.apache.org/jira/browse/SPARK-16586
> Project: Spark
>  Issue Type: Bug
>Reporter: Xiang Gao
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> When trying to run spark on a machine that cannot provide enough memory for 
> java to use, instead of printing the correct error message, spark-class will 
> crash with {{spark-class: line 83: [: too many arguments}}
> Simple shell commands to trigger this problem are:
> {code}
> ulimit -v 10
> ./sbin/start-master.sh
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16930) ApplicationMaster's code that waits for SparkContext is race-prone

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16930:


Assignee: (was: Apache Spark)

> ApplicationMaster's code that waits for SparkContext is race-prone
> --
>
> Key: SPARK-16930
> URL: https://issues.apache.org/jira/browse/SPARK-16930
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> While taking a look at SPARK-15937 and checking if there's something wrong 
> with the code, I noticed two races that explain the behavior.
> Because they're really narrow races, I'm a little wary of declaring them the 
> cause of that bug. Also because the logs posted there don't really explain 
> what went wrong (and don't really look like a SparkContext was run at all).
> The races I found are:
> - it's possible, but very unlikely, for an application to instantiate a 
> SparkContext and stop it before the AM enters the loop where it checks for 
> the instance.
> - it's possible, but very unlikely, for an application to stop the 
> SparkContext after the AM is already waiting for one, has been notified of 
> its creation, but hasn't yet stored the SparkContext reference in a local 
> variable.
> I'll fix those and clean up the code a bit in the process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16930) ApplicationMaster's code that waits for SparkContext is race-prone

2016-08-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16930:


Assignee: Apache Spark

> ApplicationMaster's code that waits for SparkContext is race-prone
> --
>
> Key: SPARK-16930
> URL: https://issues.apache.org/jira/browse/SPARK-16930
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> While taking a look at SPARK-15937 and checking if there's something wrong 
> with the code, I noticed two races that explain the behavior.
> Because they're really narrow races, I'm a little wary of declaring them the 
> cause of that bug. Also because the logs posted there don't really explain 
> what went wrong (and don't really look like a SparkContext was run at all).
> The races I found are:
> - it's possible, but very unlikely, for an application to instantiate a 
> SparkContext and stop it before the AM enters the loop where it checks for 
> the instance.
> - it's possible, but very unlikely, for an application to stop the 
> SparkContext after the AM is already waiting for one, has been notified of 
> its creation, but hasn't yet stored the SparkContext reference in a local 
> variable.
> I'll fix those and clean up the code a bit in the process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16930) ApplicationMaster's code that waits for SparkContext is race-prone

2016-08-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412130#comment-15412130
 ] 

Apache Spark commented on SPARK-16930:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14542

> ApplicationMaster's code that waits for SparkContext is race-prone
> --
>
> Key: SPARK-16930
> URL: https://issues.apache.org/jira/browse/SPARK-16930
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> While taking a look at SPARK-15937 and checking if there's something wrong 
> with the code, I noticed two races that explain the behavior.
> Because they're really narrow races, I'm a little wary of declaring them the 
> cause of that bug. Also because the logs posted there don't really explain 
> what went wrong (and don't really look like a SparkContext was run at all).
> The races I found are:
> - it's possible, but very unlikely, for an application to instantiate a 
> SparkContext and stop it before the AM enters the loop where it checks for 
> the instance.
> - it's possible, but very unlikely, for an application to stop the 
> SparkContext after the AM is already waiting for one, has been notified of 
> its creation, but hasn't yet stored the SparkContext reference in a local 
> variable.
> I'll fix those and clean up the code a bit in the process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >