[jira] [Commented] (SPARK-16965) Fix bound checking for SparseVector
[ https://issues.apache.org/jira/browse/SPARK-16965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412965#comment-15412965 ] Apache Spark commented on SPARK-16965: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/14555 > Fix bound checking for SparseVector > --- > > Key: SPARK-16965 > URL: https://issues.apache.org/jira/browse/SPARK-16965 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Priority: Minor > > There's several issues in the bound checking of SparseVector > 1. In scala, miss negative index checking and different bound checking is > scattered in several places. Should put them in one place > 2. In python, miss low/upper bound checking of indices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16965) Fix bound checking for SparseVector
[ https://issues.apache.org/jira/browse/SPARK-16965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16965: Assignee: (was: Apache Spark) > Fix bound checking for SparseVector > --- > > Key: SPARK-16965 > URL: https://issues.apache.org/jira/browse/SPARK-16965 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Priority: Minor > > There's several issues in the bound checking of SparseVector > 1. In scala, miss negative index checking and different bound checking is > scattered in several places. Should put them in one place > 2. In python, miss low/upper bound checking of indices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16965) Fix bound checking for SparseVector
[ https://issues.apache.org/jira/browse/SPARK-16965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16965: Assignee: Apache Spark > Fix bound checking for SparseVector > --- > > Key: SPARK-16965 > URL: https://issues.apache.org/jira/browse/SPARK-16965 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Assignee: Apache Spark >Priority: Minor > > There's several issues in the bound checking of SparseVector > 1. In scala, miss negative index checking and different bound checking is > scattered in several places. Should put them in one place > 2. In python, miss low/upper bound checking of indices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16965) Fix bound checking for SparseVector
[ https://issues.apache.org/jira/browse/SPARK-16965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-16965: --- Component/s: PySpark MLlib > Fix bound checking for SparseVector > --- > > Key: SPARK-16965 > URL: https://issues.apache.org/jira/browse/SPARK-16965 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Priority: Minor > > There's several issues in the bound checking of SparseVector > 1. In scala, miss negative index checking and different bound checking is > scattered in several places. Should put them in one place > 2. In python, miss low/upper bound checking of indices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16887) Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH
[ https://issues.apache.org/jira/browse/SPARK-16887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16887. -- Resolution: Won't Fix > Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH > > > Key: SPARK-16887 > URL: https://issues.apache.org/jira/browse/SPARK-16887 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Reporter: Yin Huai >Assignee: Yin Huai > > To deploy Spark, it can be pretty convenient to put all jars (spark jars, > hadoop jars, and other libs' jars) that we want to include in the classpath > of Spark in the same dir, which may not be spark's assembly dir. So, I am > proposing to also add SPARK_DIST_CLASSPATH to the LAUNCH_CLASSPATH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16965) Fix bound checking for SparseVector
Jeff Zhang created SPARK-16965: -- Summary: Fix bound checking for SparseVector Key: SPARK-16965 URL: https://issues.apache.org/jira/browse/SPARK-16965 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Jeff Zhang Priority: Minor There's several issues in the bound checking of SparseVector 1. In scala, miss negative index checking and different bound checking is scattered in several places. Should put them in one place 2. In python, miss low/upper bound checking of indices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16964) Remove private[sql] and private[spark] from sql.execution package
[ https://issues.apache.org/jira/browse/SPARK-16964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16964: Assignee: Apache Spark (was: Reynold Xin) > Remove private[sql] and private[spark] from sql.execution package > - > > Key: SPARK-16964 > URL: https://issues.apache.org/jira/browse/SPARK-16964 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > The execution package is meant to be internal, and as a result it does not > make sense to mark things as private[sql] or private[spark]. It simply makes > debugging harder when Spark developers need to inspect the plans at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16964) Remove private[sql] and private[spark] from sql.execution package
[ https://issues.apache.org/jira/browse/SPARK-16964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412907#comment-15412907 ] Apache Spark commented on SPARK-16964: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/14554 > Remove private[sql] and private[spark] from sql.execution package > - > > Key: SPARK-16964 > URL: https://issues.apache.org/jira/browse/SPARK-16964 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > The execution package is meant to be internal, and as a result it does not > make sense to mark things as private[sql] or private[spark]. It simply makes > debugging harder when Spark developers need to inspect the plans at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16964) Remove private[sql] and private[spark] from sql.execution package
[ https://issues.apache.org/jira/browse/SPARK-16964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16964: Assignee: Reynold Xin (was: Apache Spark) > Remove private[sql] and private[spark] from sql.execution package > - > > Key: SPARK-16964 > URL: https://issues.apache.org/jira/browse/SPARK-16964 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > The execution package is meant to be internal, and as a result it does not > make sense to mark things as private[sql] or private[spark]. It simply makes > debugging harder when Spark developers need to inspect the plans at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16964) Remove private[sql] and private[spark] from sql.execution package
[ https://issues.apache.org/jira/browse/SPARK-16964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16964: Description: The execution package is meant to be internal, and as a result it does not make sense to mark things as private[sql] or private[spark]. It simply makes debugging harder when Spark developers need to inspect the plans at runtime. > Remove private[sql] and private[spark] from sql.execution package > - > > Key: SPARK-16964 > URL: https://issues.apache.org/jira/browse/SPARK-16964 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > The execution package is meant to be internal, and as a result it does not > make sense to mark things as private[sql] or private[spark]. It simply makes > debugging harder when Spark developers need to inspect the plans at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16964) Remove private[sql] and private[spark] from sql.execution package
Reynold Xin created SPARK-16964: --- Summary: Remove private[sql] and private[spark] from sql.execution package Key: SPARK-16964 URL: https://issues.apache.org/jira/browse/SPARK-16964 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals
[ https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412893#comment-15412893 ] Dongjoon Hyun commented on SPARK-16955: --- Hi, [~yhuai]. Could you review the PR? The root cause was `ResolveAggregateFunctions` removed the ordinal sort orders too early. After improving the `if` condition to check the resolution is completed, the case works well. > Using ordinals in ORDER BY causes an analysis error when the query has a > GROUP BY clause using ordinals > --- > > Key: SPARK-16955 > URL: https://issues.apache.org/jira/browse/SPARK-16955 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yin Huai > > The following queries work > {code} > select a from (select 1 as a) tmp order by 1 > select a, count(*) from (select 1 as a) tmp group by 1 > select a, count(*) from (select 1 as a) tmp group by 1 order by a > {code} > However, the following query does not > {code} > select a, count(*) from (select 1 as a) tmp group by 1 order by 1 > {code} > {code} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > Group by position: '1' exceeds the size of the select list '0'. on unresolved > object, tree: > Aggregate [1] > +- SubqueryAlias tmp >+- Project [1 AS a#82] > +- OneRowRelation$ > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at >
[jira] [Comment Edited] (SPARK-16951) Alternative implementation of NOT IN to Anti-join
[ https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412882#comment-15412882 ] Nattavut Sutyanyong edited comment on SPARK-16951 at 8/9/16 3:54 AM: - The following output is tested on Spark master trunk built on August 5, 2016. {noformat} scala> Seq(1,2).toDF("c1").createOrReplaceTempView("t1") scala> Seq(1).toDF("c2").createOrReplaceTempView("t2") scala> sql("select t2.c2+1 as c3 from t1 left join t2 on t1.c1=t2.c2").createOrReplaceTempView("t3") scala> sql("select * from t1").show +---+ | c1| +---+ | 1| | 2| +---+ scala> sql("select * from t2").show +---+ | c2| +---+ | 1| +---+ scala> sql("select * from t3").show ++ | c3| ++ | 2| |null| ++ {noformat} Case 1: {noformat} scala> sql("select * from t3 where c3 not in (select c2 from t2)").show ++ | c3| ++ | 2| |null| ++ {noformat} The correct result is: {noformat} ++ | c3| ++ | 2| ++ {noformat} Case 2: {noformat} scala> sql("select * from t1 where c1 not in (select c3 from t3)").show +---+ | c1| +---+ +---+ {noformat} The answer is correct. Case 3: {noformat} scala> sql("select * from t1 where c1 not in (select c2 from t2 where 1=2)").show +---+ | c1| +---+ | 1| | 2| +---+ {noformat} The correct result is: {noformat} +---+ | c1| +---+ +---+ {noformat} was (Author: nsyca): The following output is tested on Spark master trunk built on August 5, 2016. scala> Seq(1,2).toDF("c1").createOrReplaceTempView("t1") scala> Seq(1).toDF("c2").createOrReplaceTempView("t2") scala> sql("select t2.c2+1 as c3 from t1 left join t2 on t1.c1=t2.c2").createOrReplaceTempView("t3") scala> sql("select * from t1").show +---+ | c1| +---+ | 1| | 2| +---+ scala> sql("select * from t2").show +---+ | c2| +---+ | 1| +---+ scala> sql("select * from t3").show ++ | c3| ++ | 2| |null| ++ Case 1: scala> sql("select * from t3 where c3 not in (select c2 from t2)").show ++ | c3| ++ | 2| |null| ++ The correct result is: ++ | c3| ++ | 2| ++ Case 2: scala> sql("select * from t1 where c1 not in (select c3 from t3)").show +---+ | c1| +---+ +---+ The answer is correct. Case 3: scala> sql("select * from t1 where c1 not in (select c2 from t2 where 1=2)").show +---+ | c1| +---+ | 1| | 2| +---+ The correct result is: +---+ | c1| +---+ +---+ > Alternative implementation of NOT IN to Anti-join > - > > Key: SPARK-16951 > URL: https://issues.apache.org/jira/browse/SPARK-16951 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > > A transformation currently used to process {{NOT IN}} subquery is to rewrite > to a form of Anti-join with null-aware property in the Logical Plan and then > translate to a form of {{OR}} predicate joining the parent side and the > subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} > predicate is limited to the nested-loop join execution plan, which will have > a major performance implication if both sides' results are large. > This JIRA sketches an idea of changing the OR predicate to a form similar to > the technique used in the implementation of the Existence join that addresses > the problem of {{EXISTS (..) OR ..}} type of queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16951) Alternative implementation of NOT IN to Anti-join
[ https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412882#comment-15412882 ] Nattavut Sutyanyong commented on SPARK-16951: - The following output is tested on Spark master trunk built on August 5, 2016. scala> Seq(1,2).toDF("c1").createOrReplaceTempView("t1") scala> Seq(1).toDF("c2").createOrReplaceTempView("t2") scala> sql("select t2.c2+1 as c3 from t1 left join t2 on t1.c1=t2.c2").createOrReplaceTempView("t3") scala> sql("select * from t1").show +---+ | c1| +---+ | 1| | 2| +---+ scala> sql("select * from t2").show +---+ | c2| +---+ | 1| +---+ scala> sql("select * from t3").show ++ | c3| ++ | 2| |null| ++ Case 1: scala> sql("select * from t3 where c3 not in (select c2 from t2)").show ++ | c3| ++ | 2| |null| ++ The correct result is: ++ | c3| ++ | 2| ++ Case 2: scala> sql("select * from t1 where c1 not in (select c3 from t3)").show +---+ | c1| +---+ +---+ The answer is correct. Case 3: scala> sql("select * from t1 where c1 not in (select c2 from t2 where 1=2)").show +---+ | c1| +---+ | 1| | 2| +---+ The correct result is: +---+ | c1| +---+ +---+ > Alternative implementation of NOT IN to Anti-join > - > > Key: SPARK-16951 > URL: https://issues.apache.org/jira/browse/SPARK-16951 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > > A transformation currently used to process {{NOT IN}} subquery is to rewrite > to a form of Anti-join with null-aware property in the Logical Plan and then > translate to a form of {{OR}} predicate joining the parent side and the > subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} > predicate is limited to the nested-loop join execution plan, which will have > a major performance implication if both sides' results are large. > This JIRA sketches an idea of changing the OR predicate to a form similar to > the technique used in the implementation of the Existence join that addresses > the problem of {{EXISTS (..) OR ..}} type of queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16951) Alternative implementation of NOT IN to Anti-join
[ https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412878#comment-15412878 ] Nattavut Sutyanyong commented on SPARK-16951: - The semantic of {{NOT IN}} is described in detail in "[Subqueries in Apache Spark 2.0|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/1483312212640900/6987336228780374/latest.html];. Concisely, "{{x NOT IN (subquery y)}} translates into: {{x <> y1 AND x <> y2 ... AND x <> yn}}" When {{x}} and {{subquery y}} cannot produce {{NULL}} value, {{NOT IN}} is equivalent to its {{NOT EXIST}} counterpart. That is, {{SELECT .. FROM X WHERE X.C1 NOT IN (SELECT Y.C2 FROM Y)}} is equivalent to {{SELECT .. FROM X WHERE NOT EXISTS (SELECT 1 FROM Y WHERE X.C1=Y.C2)}} however, there are 3 edge cases we need to pay attention to. Case 1. When {{X.C1}} is {{NULL}}, the row is removed from the result set. Case 2. When the {{subquery Y}} can produce {{NULL}} value to the output column {{Y.C2}}, the result is an empty set. Case 3. When the {{subquery Y}} produce an empty set, SQL language defines that the subquery will return a row of {{NULL}} value, hence this is like case 2 which returns an empty set. > Alternative implementation of NOT IN to Anti-join > - > > Key: SPARK-16951 > URL: https://issues.apache.org/jira/browse/SPARK-16951 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > > A transformation currently used to process {{NOT IN}} subquery is to rewrite > to a form of Anti-join with null-aware property in the Logical Plan and then > translate to a form of {{OR}} predicate joining the parent side and the > subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} > predicate is limited to the nested-loop join execution plan, which will have > a major performance implication if both sides' results are large. > This JIRA sketches an idea of changing the OR predicate to a form similar to > the technique used in the implementation of the Existence join that addresses > the problem of {{EXISTS (..) OR ..}} type of queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12920) Honor "spark.ui.retainedStages" to reduce mem-pressure
[ https://issues.apache.org/jira/browse/SPARK-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated SPARK-12920: - Summary: Honor "spark.ui.retainedStages" to reduce mem-pressure (was: Fix high CPU usage in spark thrift server with concurrent users) > Honor "spark.ui.retainedStages" to reduce mem-pressure > -- > > Key: SPARK-12920 > URL: https://issues.apache.org/jira/browse/SPARK-12920 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Rajesh Balamohan > Attachments: SPARK-12920.profiler.png, > SPARK-12920.profiler_job_progress_listner.png > > > - Configured with fair-share-scheduler. > - 4-5 users submitting/running jobs concurrently via spark-thrift-server > - Spark thrift server spikes to1600+% CPU and stays there for long time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412849#comment-15412849 ] Nattavut Sutyanyong commented on SPARK-16804: - Thank you, @hvanhovell, for merging my PR. > Correlated subqueries containing non-deterministic operators return incorrect > results > - > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Fix For: 2.1.0 > > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16963) Change Source API so that sources do not need to keep unbounded state
[ https://issues.apache.org/jira/browse/SPARK-16963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16963: Assignee: Apache Spark > Change Source API so that sources do not need to keep unbounded state > - > > Key: SPARK-16963 > URL: https://issues.apache.org/jira/browse/SPARK-16963 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Frederick Reiss >Assignee: Apache Spark > > The version of the Source API in Spark 2.0.0 defines a single getBatch() > method for fetching records from the source, with the following Scaladoc > comments defining the semantics: > {noformat} > /** > * Returns the data that is between the offsets (`start`, `end`]. When > `start` is `None` then > * the batch should begin with the first available record. This method must > always return the > * same data for a particular `start` and `end` pair. > */ > def getBatch(start: Option[Offset], end: Offset): DataFrame > {noformat} > These semantics mean that a Source must retain all past history for the > stream that it backs. Further, a Source is also required to retain this data > across restarts of the process where the Source is instantiated, even when > the Source is restarted on a different machine. > These restrictions make it difficult to implement the Source API, as any > implementation requires potentially unbounded amounts of distributed storage. > See the mailing list thread at > [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html] > for more information. > This JIRA will cover augmenting the Source API with an additional callback > that will allow Structured Streaming scheduler to notify the source when it > is safe to discard buffered data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16963) Change Source API so that sources do not need to keep unbounded state
[ https://issues.apache.org/jira/browse/SPARK-16963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16963: Assignee: (was: Apache Spark) > Change Source API so that sources do not need to keep unbounded state > - > > Key: SPARK-16963 > URL: https://issues.apache.org/jira/browse/SPARK-16963 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Frederick Reiss > > The version of the Source API in Spark 2.0.0 defines a single getBatch() > method for fetching records from the source, with the following Scaladoc > comments defining the semantics: > {noformat} > /** > * Returns the data that is between the offsets (`start`, `end`]. When > `start` is `None` then > * the batch should begin with the first available record. This method must > always return the > * same data for a particular `start` and `end` pair. > */ > def getBatch(start: Option[Offset], end: Offset): DataFrame > {noformat} > These semantics mean that a Source must retain all past history for the > stream that it backs. Further, a Source is also required to retain this data > across restarts of the process where the Source is instantiated, even when > the Source is restarted on a different machine. > These restrictions make it difficult to implement the Source API, as any > implementation requires potentially unbounded amounts of distributed storage. > See the mailing list thread at > [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html] > for more information. > This JIRA will cover augmenting the Source API with an additional callback > that will allow Structured Streaming scheduler to notify the source when it > is safe to discard buffered data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16963) Change Source API so that sources do not need to keep unbounded state
[ https://issues.apache.org/jira/browse/SPARK-16963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412840#comment-15412840 ] Apache Spark commented on SPARK-16963: -- User 'frreiss' has created a pull request for this issue: https://github.com/apache/spark/pull/14553 > Change Source API so that sources do not need to keep unbounded state > - > > Key: SPARK-16963 > URL: https://issues.apache.org/jira/browse/SPARK-16963 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Frederick Reiss > > The version of the Source API in Spark 2.0.0 defines a single getBatch() > method for fetching records from the source, with the following Scaladoc > comments defining the semantics: > {noformat} > /** > * Returns the data that is between the offsets (`start`, `end`]. When > `start` is `None` then > * the batch should begin with the first available record. This method must > always return the > * same data for a particular `start` and `end` pair. > */ > def getBatch(start: Option[Offset], end: Offset): DataFrame > {noformat} > These semantics mean that a Source must retain all past history for the > stream that it backs. Further, a Source is also required to retain this data > across restarts of the process where the Source is instantiated, even when > the Source is restarted on a different machine. > These restrictions make it difficult to implement the Source API, as any > implementation requires potentially unbounded amounts of distributed storage. > See the mailing list thread at > [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html] > for more information. > This JIRA will cover augmenting the Source API with an additional callback > that will allow Structured Streaming scheduler to notify the source when it > is safe to discard buffered data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options
[ https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-16610: Assignee: Hyukjin Kwon > When writing ORC files, orc.compress should not be overridden if users do not > set "compression" in the options > -- > > Key: SPARK-16610 > URL: https://issues.apache.org/jira/browse/SPARK-16610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai >Assignee: Hyukjin Kwon > Fix For: 2.0.1, 2.1.0 > > > For ORC source, Spark SQL has a writer option {{compression}}, which is used > to set the codec and its value will be also set to orc.compress (the orc conf > used for codec). However, if a user only set {{orc.compress}} in the writer > option, we should not use the default value of "compression" (snappy) as the > codec. Instead, we should respect the value of {{orc.compress}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options
[ https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-16610. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14518 [https://github.com/apache/spark/pull/14518] > When writing ORC files, orc.compress should not be overridden if users do not > set "compression" in the options > -- > > Key: SPARK-16610 > URL: https://issues.apache.org/jira/browse/SPARK-16610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai > Fix For: 2.0.1, 2.1.0 > > > For ORC source, Spark SQL has a writer option {{compression}}, which is used > to set the codec and its value will be also set to orc.compress (the orc conf > used for codec). However, if a user only set {{orc.compress}} in the writer > option, we should not use the default value of "compression" (snappy) as the > codec. Instead, we should respect the value of {{orc.compress}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16963) Change Source API so that sources do not need to keep unbounded state
Frederick Reiss created SPARK-16963: --- Summary: Change Source API so that sources do not need to keep unbounded state Key: SPARK-16963 URL: https://issues.apache.org/jira/browse/SPARK-16963 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 2.0.0 Reporter: Frederick Reiss The version of the Source API in Spark 2.0.0 defines a single getBatch() method for fetching records from the source, with the following Scaladoc comments defining the semantics: {noformat} /** * Returns the data that is between the offsets (`start`, `end`]. When `start` is `None` then * the batch should begin with the first available record. This method must always return the * same data for a particular `start` and `end` pair. */ def getBatch(start: Option[Offset], end: Offset): DataFrame {noformat} These semantics mean that a Source must retain all past history for the stream that it backs. Further, a Source is also required to retain this data across restarts of the process where the Source is instantiated, even when the Source is restarted on a different machine. These restrictions make it difficult to implement the Source API, as any implementation requires potentially unbounded amounts of distributed storage. See the mailing list thread at [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html] for more information. This JIRA will cover augmenting the Source API with an additional callback that will allow Structured Streaming scheduler to notify the source when it is safe to discard buffered data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16952) [MESOS] MesosCoarseGrainedSchedulerBackend requires spark.mesos.executor.home even if spark.executor.uri is set
[ https://issues.apache.org/jira/browse/SPARK-16952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16952: Assignee: (was: Apache Spark) > [MESOS] MesosCoarseGrainedSchedulerBackend requires spark.mesos.executor.home > even if spark.executor.uri is set > --- > > Key: SPARK-16952 > URL: https://issues.apache.org/jira/browse/SPARK-16952 > Project: Spark > Issue Type: Bug > Components: Mesos, Scheduler >Affects Versions: 1.5.2, 1.6.0, 1.6.1, 2.0.0 >Reporter: Charles Allen >Priority: Minor > > In the Mesos coarse grained scheduler, setting `spark.executor.uri` bypasses > the code path which requires `spark.mesos.executor.home` since the uri > effectively provides the executor home. > But > `org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend#createCommand` > requires `spark.mesos.executor.home` to be set regardless. > Our workaround is to set `spark.mesos.executor.home=/dev/null` when using an > executor uri. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16952) [MESOS] MesosCoarseGrainedSchedulerBackend requires spark.mesos.executor.home even if spark.executor.uri is set
[ https://issues.apache.org/jira/browse/SPARK-16952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16952: Assignee: Apache Spark > [MESOS] MesosCoarseGrainedSchedulerBackend requires spark.mesos.executor.home > even if spark.executor.uri is set > --- > > Key: SPARK-16952 > URL: https://issues.apache.org/jira/browse/SPARK-16952 > Project: Spark > Issue Type: Bug > Components: Mesos, Scheduler >Affects Versions: 1.5.2, 1.6.0, 1.6.1, 2.0.0 >Reporter: Charles Allen >Assignee: Apache Spark >Priority: Minor > > In the Mesos coarse grained scheduler, setting `spark.executor.uri` bypasses > the code path which requires `spark.mesos.executor.home` since the uri > effectively provides the executor home. > But > `org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend#createCommand` > requires `spark.mesos.executor.home` to be set regardless. > Our workaround is to set `spark.mesos.executor.home=/dev/null` when using an > executor uri. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16952) [MESOS] MesosCoarseGrainedSchedulerBackend requires spark.mesos.executor.home even if spark.executor.uri is set
[ https://issues.apache.org/jira/browse/SPARK-16952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412825#comment-15412825 ] Apache Spark commented on SPARK-16952: -- User 'mgummelt' has created a pull request for this issue: https://github.com/apache/spark/pull/14552 > [MESOS] MesosCoarseGrainedSchedulerBackend requires spark.mesos.executor.home > even if spark.executor.uri is set > --- > > Key: SPARK-16952 > URL: https://issues.apache.org/jira/browse/SPARK-16952 > Project: Spark > Issue Type: Bug > Components: Mesos, Scheduler >Affects Versions: 1.5.2, 1.6.0, 1.6.1, 2.0.0 >Reporter: Charles Allen >Priority: Minor > > In the Mesos coarse grained scheduler, setting `spark.executor.uri` bypasses > the code path which requires `spark.mesos.executor.home` since the uri > effectively provides the executor home. > But > `org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend#createCommand` > requires `spark.mesos.executor.home` to be set regardless. > Our workaround is to set `spark.mesos.executor.home=/dev/null` when using an > executor uri. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16962) Unsafe accesses (Platform.getLong()) not supported on unaligned boundaries in SPARC/Solaris
Suman Somasundar created SPARK-16962: Summary: Unsafe accesses (Platform.getLong()) not supported on unaligned boundaries in SPARC/Solaris Key: SPARK-16962 URL: https://issues.apache.org/jira/browse/SPARK-16962 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Environment: SPARC/Solaris Reporter: Suman Somasundar Unaligned accesses are not supported on SPARC architecture. Because of this, Spark applications fail by dumping core on SPARC machines whenever unaligned accesses happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-16961) Utils.randomizeInPlace does not shuffle arrays uniformly
[ https://issues.apache.org/jira/browse/SPARK-16961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas updated SPARK-16961: - Comment: was deleted (was: I am submitting a PR) > Utils.randomizeInPlace does not shuffle arrays uniformly > > > Key: SPARK-16961 > URL: https://issues.apache.org/jira/browse/SPARK-16961 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Nicholas >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > The Utils.randomizeInPlace method, which is meant to uniformly shuffle the > elements on an input array, never shuffles elements to their starting > position. That is, every permutation of the input array is equally likely to > be returned, except for any permutation in which any element is in the same > position where it began. These permutations are never output. > This is because line 827 of Utils.scala should be > {{val j = rand.nextInt(i + 1)}} > instead of > {{val j = rand.nextInt( i )}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16961) Utils.randomizeInPlace does not shuffle arrays uniformly
[ https://issues.apache.org/jira/browse/SPARK-16961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16961: Assignee: (was: Apache Spark) > Utils.randomizeInPlace does not shuffle arrays uniformly > > > Key: SPARK-16961 > URL: https://issues.apache.org/jira/browse/SPARK-16961 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Nicholas >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > The Utils.randomizeInPlace method, which is meant to uniformly shuffle the > elements on an input array, never shuffles elements to their starting > position. That is, every permutation of the input array is equally likely to > be returned, except for any permutation in which any element is in the same > position where it began. These permutations are never output. > This is because line 827 of Utils.scala should be > {{val j = rand.nextInt(i + 1)}} > instead of > {{val j = rand.nextInt( i )}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16961) Utils.randomizeInPlace does not shuffle arrays uniformly
[ https://issues.apache.org/jira/browse/SPARK-16961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412783#comment-15412783 ] Apache Spark commented on SPARK-16961: -- User 'nicklavers' has created a pull request for this issue: https://github.com/apache/spark/pull/14551 > Utils.randomizeInPlace does not shuffle arrays uniformly > > > Key: SPARK-16961 > URL: https://issues.apache.org/jira/browse/SPARK-16961 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Nicholas >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > The Utils.randomizeInPlace method, which is meant to uniformly shuffle the > elements on an input array, never shuffles elements to their starting > position. That is, every permutation of the input array is equally likely to > be returned, except for any permutation in which any element is in the same > position where it began. These permutations are never output. > This is because line 827 of Utils.scala should be > {{val j = rand.nextInt(i + 1)}} > instead of > {{val j = rand.nextInt( i )}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16961) Utils.randomizeInPlace does not shuffle arrays uniformly
[ https://issues.apache.org/jira/browse/SPARK-16961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16961: Assignee: Apache Spark > Utils.randomizeInPlace does not shuffle arrays uniformly > > > Key: SPARK-16961 > URL: https://issues.apache.org/jira/browse/SPARK-16961 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Nicholas >Assignee: Apache Spark >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > The Utils.randomizeInPlace method, which is meant to uniformly shuffle the > elements on an input array, never shuffles elements to their starting > position. That is, every permutation of the input array is equally likely to > be returned, except for any permutation in which any element is in the same > position where it began. These permutations are never output. > This is because line 827 of Utils.scala should be > {{val j = rand.nextInt(i + 1)}} > instead of > {{val j = rand.nextInt( i )}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16563) Repeat calling Spark SQL thrift server fetchResults return empty for ExecuteStatement operation
[ https://issues.apache.org/jira/browse/SPARK-16563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16563. - Resolution: Fixed Assignee: Gu Huiqin Alice Fix Version/s: 2.1.0 2.0.1 > Repeat calling Spark SQL thrift server fetchResults return empty for > ExecuteStatement operation > --- > > Key: SPARK-16563 > URL: https://issues.apache.org/jira/browse/SPARK-16563 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: Gu Huiqin Alice >Assignee: Gu Huiqin Alice >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > Repeat calling FetchResults(... orientation=FetchOrientation.FETCH_FIRST ..) > of spark sql thrift service will return empty set after calling > ExecuteStatement of TCLIService. > The bug exist in *function public RowSet getNextRowSet(FetchOrientation > orientation, long maxRows)* > https://github.com/apache/spark/blob/02c8072eea72425e89256347e1f373a3e76e6eba/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/SQLOperation.java#L332 > The iterator for geting result can be used for only once, so repeat calling > FetchResults with FETCH_FIRST parameter will return empty result. > FetchOrientation.FETCH_FIRST -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412727#comment-15412727 ] Brian commented on SPARK-6235: -- How is it possible that Spark 2.0 comes and and this bug isn't solved? A quick Google search fort "Spark 2GB limit" or "Spark Integer.MAX_VALUE" shows that this is a very real problem that affects lots of users. From the outside looking in, it seems like the Spark developers don't have an interest in solving this bug since it's been around for years at this point (including the jiras this consolidated ticket replaced). Can you provide some sort of an update? Maybe if you don't plan on fixing this issue, you can close the ticket or mark it as won't fix. At least that way we'd have some insight in to your plansThanks! > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16898) Adds argument type information for typed logical plan like MapElements, TypedFilter, and AppendColumn
[ https://issues.apache.org/jira/browse/SPARK-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-16898. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14494 [https://github.com/apache/spark/pull/14494] > Adds argument type information for typed logical plan like MapElements, > TypedFilter, and AppendColumn > - > > Key: SPARK-16898 > URL: https://issues.apache.org/jira/browse/SPARK-16898 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Priority: Minor > Fix For: 2.1.0 > > > Typed logical plan like MapElements, TypedFilter, and AppendColumn contains a > closure field: {{func: (T) => Boolean}}. For example class TypedFilter's > signature is: > {code} > case class TypedFilter( > func: AnyRef, > deserializer: Expression, > child: LogicalPlan) extends UnaryNode > {code} > From the above class signature, we cannot easily find: > 1. What is the input argument's type of the closure {{func}}? How do we know > which apply method to pick if there are multiple overloaded apply methods? > 2. What is the input argument's schema? > With this info, it is easier for us to define some custom optimizer rule to > translate these typed logical plan to more efficient implementation, like the > closure optimization idea in SPARK-14083. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16898) Adds argument type information for typed logical plan like MapElements, TypedFilter, and AppendColumn
[ https://issues.apache.org/jira/browse/SPARK-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-16898: Assignee: Sean Zhong > Adds argument type information for typed logical plan like MapElements, > TypedFilter, and AppendColumn > - > > Key: SPARK-16898 > URL: https://issues.apache.org/jira/browse/SPARK-16898 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Assignee: Sean Zhong >Priority: Minor > Fix For: 2.1.0 > > > Typed logical plan like MapElements, TypedFilter, and AppendColumn contains a > closure field: {{func: (T) => Boolean}}. For example class TypedFilter's > signature is: > {code} > case class TypedFilter( > func: AnyRef, > deserializer: Expression, > child: LogicalPlan) extends UnaryNode > {code} > From the above class signature, we cannot easily find: > 1. What is the input argument's type of the closure {{func}}? How do we know > which apply method to pick if there are multiple overloaded apply methods? > 2. What is the input argument's schema? > With this info, it is easier for us to define some custom optimizer rule to > translate these typed logical plan to more efficient implementation, like the > closure optimization idea in SPARK-14083. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16961) Utils.randomizeInPlace does not shuffle arrays uniformly
[ https://issues.apache.org/jira/browse/SPARK-16961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412705#comment-15412705 ] Nicholas commented on SPARK-16961: -- I am submitting a PR > Utils.randomizeInPlace does not shuffle arrays uniformly > > > Key: SPARK-16961 > URL: https://issues.apache.org/jira/browse/SPARK-16961 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Nicholas >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > The Utils.randomizeInPlace method, which is meant to uniformly shuffle the > elements on an input array, never shuffles elements to their starting > position. That is, every permutation of the input array is equally likely to > be returned, except for any permutation in which any element is in the same > position where it began. These permutations are never output. > This is because line 827 of Utils.scala should be > {{val j = rand.nextInt(i + 1)}} > instead of > {{val j = rand.nextInt( i )}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16961) Utils.randomizeInPlace does not shuffle arrays uniformly
[ https://issues.apache.org/jira/browse/SPARK-16961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas updated SPARK-16961: - Description: The Utils.randomizeInPlace method, which is meant to uniformly shuffle the elements on an input array, never shuffles elements to their starting position. That is, every permutation of the input array is equally likely to be returned, except for any permutation in which any element is in the same position where it began. These permutations are never output. This is because line 827 of Utils.scala should be {{val j = rand.nextInt(i + 1)}} instead of {{val j = rand.nextInt( i )}} was: The Utils.randomizeInPlace method, which is meant to uniformly shuffle the elements on an input array, never shuffles elements to their starting position. That is, every permutation of the input array is equally likely to be returned, except for any permutation in which any element is in the same position where it began. These permutations are never output. This is because line 827 of Utils.scala should be {{val j = rand.nextInt(i + 1)}} instead of {{val j = rand.nextInt(i)}} > Utils.randomizeInPlace does not shuffle arrays uniformly > > > Key: SPARK-16961 > URL: https://issues.apache.org/jira/browse/SPARK-16961 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Nicholas >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > The Utils.randomizeInPlace method, which is meant to uniformly shuffle the > elements on an input array, never shuffles elements to their starting > position. That is, every permutation of the input array is equally likely to > be returned, except for any permutation in which any element is in the same > position where it began. These permutations are never output. > This is because line 827 of Utils.scala should be > {{val j = rand.nextInt(i + 1)}} > instead of > {{val j = rand.nextInt( i )}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16961) Utils.randomizeInPlace does not shuffle arrays uniformly
Nicholas created SPARK-16961: Summary: Utils.randomizeInPlace does not shuffle arrays uniformly Key: SPARK-16961 URL: https://issues.apache.org/jira/browse/SPARK-16961 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Nicholas Priority: Minor The Utils.randomizeInPlace method, which is meant to uniformly shuffle the elements on an input array, never shuffles elements to their starting position. That is, every permutation of the input array is equally likely to be returned, except for any permutation in which any element is in the same position where it began. These permutations are never output. This is because line 827 of Utils.scala should be {{val j = rand.nextInt(i + 1)}} instead of {{val j = rand.nextInt(i)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16960) Deprecate approxCountDistinct, toDegrees and toRadians according to FunctionRegistry in Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16960: Assignee: (was: Apache Spark) > Deprecate approxCountDistinct, toDegrees and toRadians according to > FunctionRegistry in Scala and Python > > > Key: SPARK-16960 > URL: https://issues.apache.org/jira/browse/SPARK-16960 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It seems {{approxCountDistinct}}, {{toDegrees}} and {{toRadians}} are also > missed while matching the names to the ones in {{FunctionRegistry}}. (please > see > [approx_count_distinct|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L244], > > [degrees|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L203] > and > [radians|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L222] > in `FunctionRegistry`). > I took a scan between {{functions.scala}} and {{FunctionRegistry}} and it > seems these are all left. For {{countDistinct}} and {{sumDistinct}}, they are > not registered in {{FunctionRegistry}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16960) Deprecate approxCountDistinct, toDegrees and toRadians according to FunctionRegistry in Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412695#comment-15412695 ] Apache Spark commented on SPARK-16960: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/14538 > Deprecate approxCountDistinct, toDegrees and toRadians according to > FunctionRegistry in Scala and Python > > > Key: SPARK-16960 > URL: https://issues.apache.org/jira/browse/SPARK-16960 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It seems {{approxCountDistinct}}, {{toDegrees}} and {{toRadians}} are also > missed while matching the names to the ones in {{FunctionRegistry}}. (please > see > [approx_count_distinct|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L244], > > [degrees|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L203] > and > [radians|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L222] > in `FunctionRegistry`). > I took a scan between {{functions.scala}} and {{FunctionRegistry}} and it > seems these are all left. For {{countDistinct}} and {{sumDistinct}}, they are > not registered in {{FunctionRegistry}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16960) Deprecate approxCountDistinct, toDegrees and toRadians according to FunctionRegistry in Scala and Python
Hyukjin Kwon created SPARK-16960: Summary: Deprecate approxCountDistinct, toDegrees and toRadians according to FunctionRegistry in Scala and Python Key: SPARK-16960 URL: https://issues.apache.org/jira/browse/SPARK-16960 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Hyukjin Kwon Priority: Minor It seems {{approxCountDistinct}}, {{toDegrees}} and {{toRadians}} are also missed while matching the names to the ones in {{FunctionRegistry}}. (please see [approx_count_distinct|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L244], [degrees|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L203] and [radians|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L222] in `FunctionRegistry`). I took a scan between {{functions.scala}} and {{FunctionRegistry}} and it seems these are all left. For {{countDistinct}} and {{sumDistinct}}, they are not registered in {{FunctionRegistry}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16960) Deprecate approxCountDistinct, toDegrees and toRadians according to FunctionRegistry in Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16960: Assignee: Apache Spark > Deprecate approxCountDistinct, toDegrees and toRadians according to > FunctionRegistry in Scala and Python > > > Key: SPARK-16960 > URL: https://issues.apache.org/jira/browse/SPARK-16960 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > It seems {{approxCountDistinct}}, {{toDegrees}} and {{toRadians}} are also > missed while matching the names to the ones in {{FunctionRegistry}}. (please > see > [approx_count_distinct|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L244], > > [degrees|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L203] > and > [radians|https://github.com/apache/spark/blob/5c2ae79bfcf448d8dc9217efafa1409997c739de/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L222] > in `FunctionRegistry`). > I took a scan between {{functions.scala}} and {{FunctionRegistry}} and it > seems these are all left. For {{countDistinct}} and {{sumDistinct}}, they are > not registered in {{FunctionRegistry}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16959) Table Comment in the CatalogTable returned from HiveMetastore is Always Empty
[ https://issues.apache.org/jira/browse/SPARK-16959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412658#comment-15412658 ] Apache Spark commented on SPARK-16959: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14550 > Table Comment in the CatalogTable returned from HiveMetastore is Always Empty > - > > Key: SPARK-16959 > URL: https://issues.apache.org/jira/browse/SPARK-16959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > The `comment` in `CatalogTable` returned from Hive is always empty. We store > it in the table property when creating a table. However, when we try to > retrieve the table metadata from Hive metastore, we do not rebuild it. The > `comment` is always empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16959) Table Comment in the CatalogTable returned from HiveMetastore is Always Empty
[ https://issues.apache.org/jira/browse/SPARK-16959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16959: Assignee: (was: Apache Spark) > Table Comment in the CatalogTable returned from HiveMetastore is Always Empty > - > > Key: SPARK-16959 > URL: https://issues.apache.org/jira/browse/SPARK-16959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > The `comment` in `CatalogTable` returned from Hive is always empty. We store > it in the table property when creating a table. However, when we try to > retrieve the table metadata from Hive metastore, we do not rebuild it. The > `comment` is always empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16959) Table Comment in the CatalogTable returned from HiveMetastore is Always Empty
[ https://issues.apache.org/jira/browse/SPARK-16959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16959: Assignee: Apache Spark > Table Comment in the CatalogTable returned from HiveMetastore is Always Empty > - > > Key: SPARK-16959 > URL: https://issues.apache.org/jira/browse/SPARK-16959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > The `comment` in `CatalogTable` returned from Hive is always empty. We store > it in the table property when creating a table. However, when we try to > retrieve the table metadata from Hive metastore, we do not rebuild it. The > `comment` is always empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16959) Table Comment in the CatalogTable returned from HiveMetastore is Always Empty
Xiao Li created SPARK-16959: --- Summary: Table Comment in the CatalogTable returned from HiveMetastore is Always Empty Key: SPARK-16959 URL: https://issues.apache.org/jira/browse/SPARK-16959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li The `comment` in `CatalogTable` returned from Hive is always empty. We store it in the table property when creating a table. However, when we try to retrieve the table metadata from Hive metastore, we do not rebuild it. The `comment` is always empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16749) Clean-up OffsetWindowFrame
[ https://issues.apache.org/jira/browse/SPARK-16749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16749. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14376 [https://github.com/apache/spark/pull/14376] > Clean-up OffsetWindowFrame > -- > > Key: SPARK-16749 > URL: https://issues.apache.org/jira/browse/SPARK-16749 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Minor > Fix For: 2.1.0 > > > The code in OffsetWindowFrame can be a bit more streamlined and quicker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12909) Spark on Mesos accessing Secured HDFS w/Kerberos
[ https://issues.apache.org/jira/browse/SPARK-12909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412624#comment-15412624 ] Michael Gummelt commented on SPARK-12909: - DC/OS Spark has this functionality, and we'll be upstreaming it to Apache Spark soon. > Spark on Mesos accessing Secured HDFS w/Kerberos > > > Key: SPARK-12909 > URL: https://issues.apache.org/jira/browse/SPARK-12909 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Greg Senia > > Ability for Spark on Mesos to use a Kerberized HDFS FileSystem for data It > seems like this is not possible based on email chains and forum articles? If > these are true how hard would it be to get this implemented I'm willing to > try to help. > https://community.hortonworks.com/questions/5415/spark-on-yarn-vs-mesos.html > https://www.mail-archive.com/user@spark.apache.org/msg31326.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16779) Fix unnecessary use of postfix operations
[ https://issues.apache.org/jira/browse/SPARK-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-16779. Resolution: Fixed Assignee: holdenk Fix Version/s: 2.1.0 > Fix unnecessary use of postfix operations > - > > Key: SPARK-16779 > URL: https://issues.apache.org/jira/browse/SPARK-16779 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk >Assignee: holdenk > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16958) Reuse subqueries within single query
[ https://issues.apache.org/jira/browse/SPARK-16958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16958: Assignee: Apache Spark (was: Davies Liu) > Reuse subqueries within single query > > > Key: SPARK-16958 > URL: https://issues.apache.org/jira/browse/SPARK-16958 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > There could be same subquery within a single query, we could reuse the result > without running it multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16958) Reuse subqueries within single query
[ https://issues.apache.org/jira/browse/SPARK-16958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16958: Assignee: Davies Liu (was: Apache Spark) > Reuse subqueries within single query > > > Key: SPARK-16958 > URL: https://issues.apache.org/jira/browse/SPARK-16958 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > There could be same subquery within a single query, we could reuse the result > without running it multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16958) Reuse subqueries within single query
[ https://issues.apache.org/jira/browse/SPARK-16958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412578#comment-15412578 ] Apache Spark commented on SPARK-16958: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/14548 > Reuse subqueries within single query > > > Key: SPARK-16958 > URL: https://issues.apache.org/jira/browse/SPARK-16958 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > There could be same subquery within a single query, we could reuse the result > without running it multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16958) Reuse subqueries within single query
Davies Liu created SPARK-16958: -- Summary: Reuse subqueries within single query Key: SPARK-16958 URL: https://issues.apache.org/jira/browse/SPARK-16958 Project: Spark Issue Type: New Feature Components: SQL Reporter: Davies Liu Assignee: Davies Liu There could be same subquery within a single query, we could reuse the result without running it multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11150) Dynamic partition pruning
[ https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-11150: --- Assignee: (was: Davies Liu) > Dynamic partition pruning > - > > Key: SPARK-11150 > URL: https://issues.apache.org/jira/browse/SPARK-11150 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.1, 1.6.0, 2.0.0 >Reporter: Younes > > Partitions are not pruned when joined on the partition columns. > This is the same issue as HIVE-9152. > Ex: > Select from tab where partcol=1 will prune on value 1 > Select from tab join dim on (dim.partcol=tab.partcol) where > dim.partcol=1 will scan all partitions. > Tables are based on parquets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11150) Dynamic partition pruning
[ https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-11150: --- Target Version/s: (was: 2.1.0) > Dynamic partition pruning > - > > Key: SPARK-11150 > URL: https://issues.apache.org/jira/browse/SPARK-11150 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.1, 1.6.0, 2.0.0 >Reporter: Younes >Assignee: Davies Liu > > Partitions are not pruned when joined on the partition columns. > This is the same issue as HIVE-9152. > Ex: > Select from tab where partcol=1 will prune on value 1 > Select from tab join dim on (dim.partcol=tab.partcol) where > dim.partcol=1 will scan all partitions. > Tables are based on parquets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16718) gbm-style treeboost
[ https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412565#comment-15412565 ] Apache Spark commented on SPARK-16718: -- User 'vlad17' has created a pull request for this issue: https://github.com/apache/spark/pull/14547 > gbm-style treeboost > --- > > Key: SPARK-16718 > URL: https://issues.apache.org/jira/browse/SPARK-16718 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Vladimir Feinberg >Assignee: Vladimir Feinberg > > .As an initial minimal change, we should provide TreeBoost as implemented in > GBM for L1, L2, and logistic losses: by introducing a new "loss-based" > impurity, tree leafs in GBTs can have loss-optimal predictions for their > partition of the data. > Commit should have evidence of accuracy improvment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16718) gbm-style treeboost
[ https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16718: Assignee: Vladimir Feinberg (was: Apache Spark) > gbm-style treeboost > --- > > Key: SPARK-16718 > URL: https://issues.apache.org/jira/browse/SPARK-16718 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Vladimir Feinberg >Assignee: Vladimir Feinberg > > .As an initial minimal change, we should provide TreeBoost as implemented in > GBM for L1, L2, and logistic losses: by introducing a new "loss-based" > impurity, tree leafs in GBTs can have loss-optimal predictions for their > partition of the data. > Commit should have evidence of accuracy improvment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16718) gbm-style treeboost
[ https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16718: Assignee: Apache Spark (was: Vladimir Feinberg) > gbm-style treeboost > --- > > Key: SPARK-16718 > URL: https://issues.apache.org/jira/browse/SPARK-16718 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Vladimir Feinberg >Assignee: Apache Spark > > .As an initial minimal change, we should provide TreeBoost as implemented in > GBM for L1, L2, and logistic losses: by introducing a new "loss-based" > impurity, tree leafs in GBTs can have loss-optimal predictions for their > partition of the data. > Commit should have evidence of accuracy improvment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals
[ https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16955: Assignee: (was: Apache Spark) > Using ordinals in ORDER BY causes an analysis error when the query has a > GROUP BY clause using ordinals > --- > > Key: SPARK-16955 > URL: https://issues.apache.org/jira/browse/SPARK-16955 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yin Huai > > The following queries work > {code} > select a from (select 1 as a) tmp order by 1 > select a, count(*) from (select 1 as a) tmp group by 1 > select a, count(*) from (select 1 as a) tmp group by 1 order by a > {code} > However, the following query does not > {code} > select a, count(*) from (select 1 as a) tmp group by 1 order by 1 > {code} > {code} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > Group by position: '1' exceeds the size of the select list '0'. on unresolved > object, tree: > Aggregate [1] > +- SubqueryAlias tmp >+- Project [1 AS a#82] > +- OneRowRelation$ > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181) > at >
[jira] [Assigned] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals
[ https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16955: Assignee: Apache Spark > Using ordinals in ORDER BY causes an analysis error when the query has a > GROUP BY clause using ordinals > --- > > Key: SPARK-16955 > URL: https://issues.apache.org/jira/browse/SPARK-16955 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yin Huai >Assignee: Apache Spark > > The following queries work > {code} > select a from (select 1 as a) tmp order by 1 > select a, count(*) from (select 1 as a) tmp group by 1 > select a, count(*) from (select 1 as a) tmp group by 1 order by a > {code} > However, the following query does not > {code} > select a, count(*) from (select 1 as a) tmp group by 1 order by 1 > {code} > {code} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > Group by position: '1' exceeds the size of the select list '0'. on unresolved > object, tree: > Aggregate [1] > +- SubqueryAlias tmp >+- Project [1 AS a#82] > +- OneRowRelation$ > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181) > at >
[jira] [Commented] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals
[ https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412551#comment-15412551 ] Apache Spark commented on SPARK-16955: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/14546 > Using ordinals in ORDER BY causes an analysis error when the query has a > GROUP BY clause using ordinals > --- > > Key: SPARK-16955 > URL: https://issues.apache.org/jira/browse/SPARK-16955 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yin Huai > > The following queries work > {code} > select a from (select 1 as a) tmp order by 1 > select a, count(*) from (select 1 as a) tmp group by 1 > select a, count(*) from (select 1 as a) tmp group by 1 order by a > {code} > However, the following query does not > {code} > select a, count(*) from (select 1 as a) tmp group by 1 order by 1 > {code} > {code} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > Group by position: '1' exceeds the size of the select list '0'. on unresolved > object, tree: > Aggregate [1] > +- SubqueryAlias tmp >+- Project [1 AS a#82] > +- OneRowRelation$ > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181) >
[jira] [Resolved] (SPARK-12326) Move GBT implementation from spark.mllib to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-12326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12326. --- Resolution: Done > Move GBT implementation from spark.mllib to spark.ml > > > Key: SPARK-12326 > URL: https://issues.apache.org/jira/browse/SPARK-12326 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > > Several improvements can be made to gradient boosted trees, but are not > possible without moving the GBT implementation to spark.ml (e.g. > rawPrediction column, feature importance). This Jira is for moving the > current GBT implementation to spark.ml, which will have roughly the following > steps: > 1. Copy the implementation to spark.ml and change spark.ml classes to use > that implementation. Current tests will ensure that the implementations learn > exactly the same models. > 2. Move the decision tree helper classes over to spark.ml (e.g. Impurity, > InformationGainStats, ImpurityStats, DTStatsAggregator, etc...). Since > eventually all tree implementations will reside in spark.ml, the helper > classes should as well. > 3. Remove the spark.mllib implementation, and make the spark.mllib APIs > wrappers around the spark.ml implementation. The spark.ml tests will again > ensure that we do not change any behavior. > 4. Move the unit tests to spark.ml, and change the spark.mllib unit tests to > verify model equivalence. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12383) Move unit tests for GBT from spark.mllib to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-12383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12383. --- Resolution: Duplicate > Move unit tests for GBT from spark.mllib to spark.ml > > > Key: SPARK-12383 > URL: https://issues.apache.org/jira/browse/SPARK-12383 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > After the GBT implementation is moved from MLlib to ML, we should move the > unit tests to ML as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12326) Move GBT implementation from spark.mllib to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-12326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12326: -- Priority: Minor (was: Major) > Move GBT implementation from spark.mllib to spark.ml > > > Key: SPARK-12326 > URL: https://issues.apache.org/jira/browse/SPARK-12326 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > > Several improvements can be made to gradient boosted trees, but are not > possible without moving the GBT implementation to spark.ml (e.g. > rawPrediction column, feature importance). This Jira is for moving the > current GBT implementation to spark.ml, which will have roughly the following > steps: > 1. Copy the implementation to spark.ml and change spark.ml classes to use > that implementation. Current tests will ensure that the implementations learn > exactly the same models. > 2. Move the decision tree helper classes over to spark.ml (e.g. Impurity, > InformationGainStats, ImpurityStats, DTStatsAggregator, etc...). Since > eventually all tree implementations will reside in spark.ml, the helper > classes should as well. > 3. Remove the spark.mllib implementation, and make the spark.mllib APIs > wrappers around the spark.ml implementation. The spark.ml tests will again > ensure that we do not change any behavior. > 4. Move the unit tests to spark.ml, and change the spark.mllib unit tests to > verify model equivalence. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private
[ https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12381. --- Resolution: Duplicate > Copy public decision tree helper classes from spark.mllib to spark.ml and > make private > -- > > Key: SPARK-12381 > URL: https://issues.apache.org/jira/browse/SPARK-12381 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > The helper classes for decision trees and decision tree ensembles (e.g. > Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) > currently reside in spark.mllib, but as the algorithm implementations are > moved to spark.ml, so should these helper classes. > We should take this opportunity to make some of those helper classes private > when possible (especially if they are only needed during training) and maybe > change the APIs (especially if we can eliminate duplicate data stored in the > final model). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private
[ https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412476#comment-15412476 ] Seth Hendrickson commented on SPARK-12381: -- I haven't looked at this in a while. Please feel free to take it over. > Copy public decision tree helper classes from spark.mllib to spark.ml and > make private > -- > > Key: SPARK-12381 > URL: https://issues.apache.org/jira/browse/SPARK-12381 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > The helper classes for decision trees and decision tree ensembles (e.g. > Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) > currently reside in spark.mllib, but as the algorithm implementations are > moved to spark.ml, so should these helper classes. > We should take this opportunity to make some of those helper classes private > when possible (especially if they are only needed during training) and maybe > change the APIs (especially if we can eliminate duplicate data stored in the > final model). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals
[ https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412475#comment-15412475 ] Dongjoon Hyun commented on SPARK-16955: --- `ResolveAggregateFunctions` seems to have a bug to drop the ordinals. I'll make a PR after some testing. > Using ordinals in ORDER BY causes an analysis error when the query has a > GROUP BY clause using ordinals > --- > > Key: SPARK-16955 > URL: https://issues.apache.org/jira/browse/SPARK-16955 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yin Huai > > The following queries work > {code} > select a from (select 1 as a) tmp order by 1 > select a, count(*) from (select 1 as a) tmp group by 1 > select a, count(*) from (select 1 as a) tmp group by 1 order by a > {code} > However, the following query does not > {code} > select a, count(*) from (select 1 as a) tmp group by 1 order by 1 > {code} > {code} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > Group by position: '1' exceeds the size of the select list '0'. on unresolved > object, tree: > Aggregate [1] > +- SubqueryAlias tmp >+- Project [1 AS a#82] > +- OneRowRelation$ > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181) >
[jira] [Commented] (SPARK-11150) Dynamic partition pruning
[ https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412468#comment-15412468 ] Apache Spark commented on SPARK-11150: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/14545 > Dynamic partition pruning > - > > Key: SPARK-11150 > URL: https://issues.apache.org/jira/browse/SPARK-11150 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.1, 1.6.0, 2.0.0 >Reporter: Younes >Assignee: Davies Liu > > Partitions are not pruned when joined on the partition columns. > This is the same issue as HIVE-9152. > Ex: > Select from tab where partcol=1 will prune on value 1 > Select from tab join dim on (dim.partcol=tab.partcol) where > dim.partcol=1 will scan all partitions. > Tables are based on parquets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16957) Use weighted midpoints for split values.
[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-16957: -- Priority: Trivial (was: Major) > Use weighted midpoints for split values. > > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg >Priority: Trivial > > Just like R's gbm, we should be using weighted split points rather than the > actual continuous binned feature values. For instance, in a dataset > containing binary features (that are fed in as continuous ones), our splits > are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some > smoothness qualities, this is asymptotically bad compared to GBM's approach. > The split point should be a weighted split point of the two values of the > "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, > the above split should be at {{0.75}}. > Example: > {code} > +++-+-+ > |feature0|feature1|label|count| > +++-+-+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0|2| > | 0.0| 0.0| 1.0|2| > | 0.0| 1.0| 0.0|7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0|7| > | 1.0| 1.0| 0.0| 18| > +++-+-+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) >If (feature 1 <= 0.0) > Predict: -0.56 >Else (feature 1 > 0.0) > Predict: 0.29333 > Else (feature 0 > 0.0) >If (feature 1 <= 0.0) > Predict: 0.56 >Else (feature 1 > 0.0) > Predict: -0.29333 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11638) Run Spark on Mesos with bridge networking
[ https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412461#comment-15412461 ] Michael Gummelt commented on SPARK-11638: - [~radekg] > The only advantage we had was using the same configuration inside of the > docker container. You mean you want to run the spark driver in a docker container? Which configuration did you have to change? I can look more into this, but I need a clear "It's easier/better to do X in bridge mode than in host mode". > So with the HTTP API, Spark would still require the heavy libmesos in order > to work with Mesos? No. The HTTP API will remove the libmesos dependency, which is nice. It's not an urgent priority though. > Run Spark on Mesos with bridge networking > - > > Key: SPARK-11638 > URL: https://issues.apache.org/jira/browse/SPARK-11638 > Project: Spark > Issue Type: Improvement > Components: Mesos, Spark Core >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0 >Reporter: Radoslaw Gruchalski > Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, > 1.5.2.patch, 1.6.0.patch, 2.3.11.patch, 2.3.4.patch > > > h4. Summary > Provides {{spark.driver.advertisedPort}}, > {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and > {{spark.replClassServer.advertisedPort}} settings to enable running Spark in > Mesos on Docker with Bridge networking. Provides patches for Akka Remote to > enable Spark driver advertisement using alternative host and port. > With these settings, it is possible to run Spark Master in a Docker container > and have the executors running on Mesos talk back correctly to such Master. > The problem is discussed on the Mesos mailing list here: > https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E > h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door > In order for the framework to receive orders in the bridged container, Mesos > in the container has to register for offers using the IP address of the > Agent. Offers are sent by Mesos Master to the Docker container running on a > different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} > would advertise itself using the IP address of the container, something like > {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a > different host, it's a different machine. Mesos 0.24.0 introduced two new > properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and > {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's > address to register for offers. This was provided mainly for running Mesos in > Docker on Mesos. > h4. Spark - how does the above relate and what is being addressed here? > Similar to Mesos, out of the box, Spark does not allow to advertise its > services on ports different than bind ports. Consider following scenario: > Spark is running inside a Docker container on Mesos, it's a bridge networking > mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for > the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and > {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to > Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the > container ports. Starting the executors from such container results in > executors not being able to communicate back to the Spark Master. > This happens because of 2 things: > Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} > transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port > different to what it bound to. The settings discussed are here: > https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376. > These do not exist in Akka {{2.3.x}}. Spark driver will always advertise > port {{}} as this is the one {{akka-remote}} is bound to. > Any URIs the executors contact the Spark Master on, are prepared by Spark > Master and handed over to executors. These always contain the port number > used by the Master to find the service on. The services are: > - {{spark.broadcast.port}} > - {{spark.fileserver.port}} > - {{spark.replClassServer.port}} > all above ports are by default {{0}} (random assignment) but can be specified > using Spark configuration ( {{-Dspark...port}} ). However, they are limited > in the same way as the {{spark.driver.port}}; in the above example, an > executor should not contact the file server on port {{6677}} but rather on > the respective 31xxx assigned by Mesos. > Spark currently does not allow any of that. > h4. Taking on the problem, step 1: Spark Driver > As mentioned above, Spark
[jira] [Commented] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private
[ https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412459#comment-15412459 ] Vladimir Feinberg commented on SPARK-12381: --- Yeah, that'd be a good idea. > Copy public decision tree helper classes from spark.mllib to spark.ml and > make private > -- > > Key: SPARK-12381 > URL: https://issues.apache.org/jira/browse/SPARK-12381 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > The helper classes for decision trees and decision tree ensembles (e.g. > Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) > currently reside in spark.mllib, but as the algorithm implementations are > moved to spark.ml, so should these helper classes. > We should take this opportunity to make some of those helper classes private > when possible (especially if they are only needed during training) and maybe > change the APIs (especially if we can eliminate duplicate data stored in the > final model). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16957) Use weighted midpoints for split values.
[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16957: -- Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-14045) > Use weighted midpoints for split values. > > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg > > Just like R's gbm, we should be using weighted split points rather than the > actual continuous binned feature values. For instance, in a dataset > containing binary features (that are fed in as continuous ones), our splits > are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some > smoothness qualities, this is asymptotically bad compared to GBM's approach. > The split point should be a weighted split point of the two values of the > "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, > the above split should be at {{0.75}}. > Example: > {code} > +++-+-+ > |feature0|feature1|label|count| > +++-+-+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0|2| > | 0.0| 0.0| 1.0|2| > | 0.0| 1.0| 0.0|7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0|7| > | 1.0| 1.0| 0.0| 18| > +++-+-+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) >If (feature 1 <= 0.0) > Predict: -0.56 >Else (feature 1 > 0.0) > Predict: 0.29333 > Else (feature 0 > 0.0) >If (feature 1 <= 0.0) > Predict: 0.56 >Else (feature 1 > 0.0) > Predict: -0.29333 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16957) Use weighted midpoints for split values.
Vladimir Feinberg created SPARK-16957: - Summary: Use weighted midpoints for split values. Key: SPARK-16957 URL: https://issues.apache.org/jira/browse/SPARK-16957 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Vladimir Feinberg Just like R's gbm, we should be using weighted split points rather than the actual continuous binned feature values. For instance, in a dataset containing binary features (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some smoothness qualities, this is asymptotically bad compared to GBM's approach. The split point should be a weighted split point of the two values of the "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at {{0.75}}. Example: {code} +++-+-+ |feature0|feature1|label|count| +++-+-+ | 0.0| 0.0| 0.0| 23| | 1.0| 0.0| 0.0|2| | 0.0| 0.0| 1.0|2| | 0.0| 1.0| 0.0|7| | 1.0| 0.0| 1.0| 23| | 0.0| 1.0| 1.0| 18| | 1.0| 1.0| 1.0|7| | 1.0| 1.0| 0.0| 18| +++-+-+ DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes If (feature 0 <= 0.0) If (feature 1 <= 0.0) Predict: -0.56 Else (feature 1 > 0.0) Predict: 0.29333 Else (feature 0 > 0.0) If (feature 1 <= 0.0) Predict: 0.56 Else (feature 1 > 0.0) Predict: -0.29333 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private
[ https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412451#comment-15412451 ] Sean Owen commented on SPARK-12381: --- Are you also basically subsuming https://issues.apache.org/jira/browse/SPARK-12383 ? I'd like to mark all of these as duplicates then, since they don't have activity. > Copy public decision tree helper classes from spark.mllib to spark.ml and > make private > -- > > Key: SPARK-12381 > URL: https://issues.apache.org/jira/browse/SPARK-12381 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > The helper classes for decision trees and decision tree ensembles (e.g. > Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) > currently reside in spark.mllib, but as the algorithm implementations are > moved to spark.ml, so should these helper classes. > We should take this opportunity to make some of those helper classes private > when possible (especially if they are only needed during training) and maybe > change the APIs (especially if we can eliminate duplicate data stored in the > final model). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private
[ https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12381: -- Comment: was deleted (was: I will be out of the office until Monday 15 August. I will not have regular access to email during this time, but will respond upon my return. For any urgent enquiries, please contact Frederick Kruger at frederick.kru...@quantium.com.au, or call the office on +61 2 9292 6400. ) > Copy public decision tree helper classes from spark.mllib to spark.ml and > make private > -- > > Key: SPARK-12381 > URL: https://issues.apache.org/jira/browse/SPARK-12381 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > The helper classes for decision trees and decision tree ensembles (e.g. > Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) > currently reside in spark.mllib, but as the algorithm implementations are > moved to spark.ml, so should these helper classes. > We should take this opportunity to make some of those helper classes private > when possible (especially if they are only needed during training) and maybe > change the APIs (especially if we can eliminate duplicate data stored in the > final model). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11150) Dynamic partition pruning
[ https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-11150: --- Affects Version/s: 2.0.0 Target Version/s: 2.1.0 Issue Type: New Feature (was: Bug) > Dynamic partition pruning > - > > Key: SPARK-11150 > URL: https://issues.apache.org/jira/browse/SPARK-11150 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.1, 1.6.0, 2.0.0 >Reporter: Younes >Assignee: Davies Liu > > Partitions are not pruned when joined on the partition columns. > This is the same issue as HIVE-9152. > Ex: > Select from tab where partcol=1 will prune on value 1 > Select from tab join dim on (dim.partcol=tab.partcol) where > dim.partcol=1 will scan all partitions. > Tables are based on parquets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11150) Dynamic partition pruning
[ https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-11150: -- Assignee: Davies Liu > Dynamic partition pruning > - > > Key: SPARK-11150 > URL: https://issues.apache.org/jira/browse/SPARK-11150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1, 1.6.0 >Reporter: Younes >Assignee: Davies Liu > > Partitions are not pruned when joined on the partition columns. > This is the same issue as HIVE-9152. > Ex: > Select from tab where partcol=1 will prune on value 1 > Select from tab join dim on (dim.partcol=tab.partcol) where > dim.partcol=1 will scan all partitions. > Tables are based on parquets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private
[ https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412439#comment-15412439 ] Matthew Carle commented on SPARK-12381: --- I will be out of the office until Monday 15 August. I will not have regular access to email during this time, but will respond upon my return. For any urgent enquiries, please contact Frederick Kruger at frederick.kru...@quantium.com.au, or call the office on +61 2 9292 6400. > Copy public decision tree helper classes from spark.mllib to spark.ml and > make private > -- > > Key: SPARK-12381 > URL: https://issues.apache.org/jira/browse/SPARK-12381 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > The helper classes for decision trees and decision tree ensembles (e.g. > Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) > currently reside in spark.mllib, but as the algorithm implementations are > moved to spark.ml, so should these helper classes. > We should take this opportunity to make some of those helper classes private > when possible (especially if they are only needed during training) and maybe > change the APIs (especially if we can eliminate duplicate data stored in the > final model). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private
[ https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412438#comment-15412438 ] Vladimir Feinberg commented on SPARK-12381: --- [~sethah] Just so we don't clash, I think these two JIRAs are overlapping: SPARK-16728 > Copy public decision tree helper classes from spark.mllib to spark.ml and > make private > -- > > Key: SPARK-12381 > URL: https://issues.apache.org/jira/browse/SPARK-12381 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > The helper classes for decision trees and decision tree ensembles (e.g. > Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) > currently reside in spark.mllib, but as the algorithm implementations are > moved to spark.ml, so should these helper classes. > We should take this opportunity to make some of those helper classes private > when possible (especially if they are only needed during training) and maybe > change the APIs (especially if we can eliminate duplicate data stored in the > final model). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16953) Make requestTotalExecutors public to be consistent with requestExecutors/killExecutors
[ https://issues.apache.org/jira/browse/SPARK-16953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-16953. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14541 [https://github.com/apache/spark/pull/14541] > Make requestTotalExecutors public to be consistent with > requestExecutors/killExecutors > -- > > Key: SPARK-16953 > URL: https://issues.apache.org/jira/browse/SPARK-16953 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16956) Make ApplicationState.MAX_NUM_RETRY configurable
[ https://issues.apache.org/jira/browse/SPARK-16956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16956: Assignee: Josh Rosen (was: Apache Spark) > Make ApplicationState.MAX_NUM_RETRY configurable > > > Key: SPARK-16956 > URL: https://issues.apache.org/jira/browse/SPARK-16956 > Project: Spark > Issue Type: New Feature > Components: Deploy >Reporter: Josh Rosen >Assignee: Josh Rosen > > The {{ApplicationState.MAX_NUM_RETRY}} setting, which controls the maximum > number of back-to-back executor failures that the standalone cluster manager > will tolerate before removing a faulty application, is currently a hardcoded > constant (10), but there are use-cases for making it configurable (TBD in my > PR). We should add a new configuration key to let users customize this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16956) Make ApplicationState.MAX_NUM_RETRY configurable
[ https://issues.apache.org/jira/browse/SPARK-16956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16956: Assignee: Apache Spark (was: Josh Rosen) > Make ApplicationState.MAX_NUM_RETRY configurable > > > Key: SPARK-16956 > URL: https://issues.apache.org/jira/browse/SPARK-16956 > Project: Spark > Issue Type: New Feature > Components: Deploy >Reporter: Josh Rosen >Assignee: Apache Spark > > The {{ApplicationState.MAX_NUM_RETRY}} setting, which controls the maximum > number of back-to-back executor failures that the standalone cluster manager > will tolerate before removing a faulty application, is currently a hardcoded > constant (10), but there are use-cases for making it configurable (TBD in my > PR). We should add a new configuration key to let users customize this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16956) Make ApplicationState.MAX_NUM_RETRY configurable
Josh Rosen created SPARK-16956: -- Summary: Make ApplicationState.MAX_NUM_RETRY configurable Key: SPARK-16956 URL: https://issues.apache.org/jira/browse/SPARK-16956 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Josh Rosen Assignee: Josh Rosen The {{ApplicationState.MAX_NUM_RETRY}} setting, which controls the maximum number of back-to-back executor failures that the standalone cluster manager will tolerate before removing a faulty application, is currently a hardcoded constant (10), but there are use-cases for making it configurable (TBD in my PR). We should add a new configuration key to let users customize this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11638) Run Spark on Mesos with bridge networking
[ https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412293#comment-15412293 ] Radoslaw Gruchalski commented on SPARK-11638: - [~mandoskippy] Yes, there my lack of knowledge regarding the API can be seen. Just read the http://events.linuxfoundation.org/sites/events/files/slides/Mesos_HTTP_API.pdf. Considering that the Mesos scheduler was changed to take advantage, might be the case. Older versions of Mesos would still require native library. > Run Spark on Mesos with bridge networking > - > > Key: SPARK-11638 > URL: https://issues.apache.org/jira/browse/SPARK-11638 > Project: Spark > Issue Type: Improvement > Components: Mesos, Spark Core >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0 >Reporter: Radoslaw Gruchalski > Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, > 1.5.2.patch, 1.6.0.patch, 2.3.11.patch, 2.3.4.patch > > > h4. Summary > Provides {{spark.driver.advertisedPort}}, > {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and > {{spark.replClassServer.advertisedPort}} settings to enable running Spark in > Mesos on Docker with Bridge networking. Provides patches for Akka Remote to > enable Spark driver advertisement using alternative host and port. > With these settings, it is possible to run Spark Master in a Docker container > and have the executors running on Mesos talk back correctly to such Master. > The problem is discussed on the Mesos mailing list here: > https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E > h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door > In order for the framework to receive orders in the bridged container, Mesos > in the container has to register for offers using the IP address of the > Agent. Offers are sent by Mesos Master to the Docker container running on a > different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} > would advertise itself using the IP address of the container, something like > {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a > different host, it's a different machine. Mesos 0.24.0 introduced two new > properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and > {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's > address to register for offers. This was provided mainly for running Mesos in > Docker on Mesos. > h4. Spark - how does the above relate and what is being addressed here? > Similar to Mesos, out of the box, Spark does not allow to advertise its > services on ports different than bind ports. Consider following scenario: > Spark is running inside a Docker container on Mesos, it's a bridge networking > mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for > the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and > {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to > Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the > container ports. Starting the executors from such container results in > executors not being able to communicate back to the Spark Master. > This happens because of 2 things: > Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} > transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port > different to what it bound to. The settings discussed are here: > https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376. > These do not exist in Akka {{2.3.x}}. Spark driver will always advertise > port {{}} as this is the one {{akka-remote}} is bound to. > Any URIs the executors contact the Spark Master on, are prepared by Spark > Master and handed over to executors. These always contain the port number > used by the Master to find the service on. The services are: > - {{spark.broadcast.port}} > - {{spark.fileserver.port}} > - {{spark.replClassServer.port}} > all above ports are by default {{0}} (random assignment) but can be specified > using Spark configuration ( {{-Dspark...port}} ). However, they are limited > in the same way as the {{spark.driver.port}}; in the above example, an > executor should not contact the file server on port {{6677}} but rather on > the respective 31xxx assigned by Mesos. > Spark currently does not allow any of that. > h4. Taking on the problem, step 1: Spark Driver > As mentioned above, Spark Driver is based on {{akka-remote}}. In order to > take on the problem, the {{akka.remote.net.tcp.bind-hostname}} and > {{akka.remote.net.tcp.bind-port}} settings are a must. Spark does not compile > with Akka
[jira] [Updated] (SPARK-16552) Store the Inferred Schemas into External Catalog Tables when Creating Tables
[ https://issues.apache.org/jira/browse/SPARK-16552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16552: - Labels: release_notes releasenotes (was: ) > Store the Inferred Schemas into External Catalog Tables when Creating Tables > > > Key: SPARK-16552 > URL: https://issues.apache.org/jira/browse/SPARK-16552 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > Labels: release_notes, releasenotes > Fix For: 2.1.0 > > > Currently, in Spark SQL, the initial creation of schema can be classified > into two groups. It is applicable to both Hive tables and Data Source tables: > Group A. Users specify the schema. > Case 1 CREATE TABLE AS SELECT: the schema is determined by the result schema > of the SELECT clause. For example, > {noformat} > CREATE TABLE tab STORED AS TEXTFILE > AS SELECT * from input > {noformat} > Case 2 CREATE TABLE: users explicitly specify the schema. For example, > {noformat} > CREATE TABLE jsonTable (_1 string, _2 string) > USING org.apache.spark.sql.json > {noformat} > Group B. Spark SQL infer the schema at runtime. > Case 3 CREATE TABLE. Users do not specify the schema but the path to the file > location. For example, > {noformat} > CREATE TABLE jsonTable > USING org.apache.spark.sql.json > OPTIONS (path '${tempDir.getCanonicalPath}') > {noformat} > Now, Spark SQL does not store the inferred schema in the external catalog for > the cases in Group B. When users refreshing the metadata cache, accessing the > table at the first time after (re-)starting Spark, Spark SQL will infer the > schema and store the info in the metadata cache for improving the performance > of subsequent metadata requests. However, the runtime schema inference could > cause undesirable schema changes after each reboot of Spark. > It is desirable to store the inferred schema in the external catalog when > creating the table. When users intend to refresh the schema, they issue > `REFRESH TABLE`. Spark SQL will infer the schema again based on the > previously specified table location and update/refresh the schema in the > external catalog and metadata cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16953) Make requestTotalExecutors public to be consistent with requestExecutors/killExecutors
[ https://issues.apache.org/jira/browse/SPARK-16953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16953: Assignee: Apache Spark (was: Tathagata Das) > Make requestTotalExecutors public to be consistent with > requestExecutors/killExecutors > -- > > Key: SPARK-16953 > URL: https://issues.apache.org/jira/browse/SPARK-16953 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Tathagata Das >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16953) Make requestTotalExecutors public to be consistent with requestExecutors/killExecutors
[ https://issues.apache.org/jira/browse/SPARK-16953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412290#comment-15412290 ] Apache Spark commented on SPARK-16953: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/14541 > Make requestTotalExecutors public to be consistent with > requestExecutors/killExecutors > -- > > Key: SPARK-16953 > URL: https://issues.apache.org/jira/browse/SPARK-16953 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16953) Make requestTotalExecutors public to be consistent with requestExecutors/killExecutors
[ https://issues.apache.org/jira/browse/SPARK-16953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16953: Assignee: Tathagata Das (was: Apache Spark) > Make requestTotalExecutors public to be consistent with > requestExecutors/killExecutors > -- > > Key: SPARK-16953 > URL: https://issues.apache.org/jira/browse/SPARK-16953 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals
[ https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412265#comment-15412265 ] Dongjoon Hyun commented on SPARK-16955: --- Sure! Thank you, [~yhuai]. I'll take a look this. > Using ordinals in ORDER BY causes an analysis error when the query has a > GROUP BY clause using ordinals > --- > > Key: SPARK-16955 > URL: https://issues.apache.org/jira/browse/SPARK-16955 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yin Huai > > The following queries work > {code} > select a from (select 1 as a) tmp order by 1 > select a, count(*) from (select 1 as a) tmp group by 1 > select a, count(*) from (select 1 as a) tmp group by 1 order by a > {code} > However, the following query does not > {code} > select a, count(*) from (select 1 as a) tmp group by 1 order by 1 > {code} > {code} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > Group by position: '1' exceeds the size of the select list '0'. on unresolved > object, tree: > Aggregate [1] > +- SubqueryAlias tmp >+- Project [1 AS a#82] > +- OneRowRelation$ > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181) > at >
[jira] [Commented] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals
[ https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412263#comment-15412263 ] Yin Huai commented on SPARK-16955: -- [~dongjoon] Will have time to take a look? > Using ordinals in ORDER BY causes an analysis error when the query has a > GROUP BY clause using ordinals > --- > > Key: SPARK-16955 > URL: https://issues.apache.org/jira/browse/SPARK-16955 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yin Huai > > The following queries work > {code} > select a from (select 1 as a) tmp order by 1 > select a, count(*) from (select 1 as a) tmp group by 1 > select a, count(*) from (select 1 as a) tmp group by 1 order by a > {code} > However, the following query does not > {code} > select a, count(*) from (select 1 as a) tmp group by 1 order by 1 > {code} > {code} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > Group by position: '1' exceeds the size of the select list '0'. on unresolved > object, tree: > Aggregate [1] > +- SubqueryAlias tmp >+- Project [1 AS a#82] > +- OneRowRelation$ > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181) > at >
[jira] [Created] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals
Yin Huai created SPARK-16955: Summary: Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals Key: SPARK-16955 URL: https://issues.apache.org/jira/browse/SPARK-16955 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Yin Huai The following queries work {code} select a from (select 1 as a) tmp order by 1 select a, count(*) from (select 1 as a) tmp group by 1 select a, count(*) from (select 1 as a) tmp group by 1 order by a {code} However, the following query does not {code} select a, count(*) from (select 1 as a) tmp group by 1 order by 1 {code} {code} org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to Group by position: '1' exceeds the size of the select list '0'. on unresolved object, tree: Aggregate [1] +- SubqueryAlias tmp +- Project [1 AS a#82] +- OneRowRelation$ at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) at scala.collection.immutable.List.foldLeft(List.scala:84) at
[jira] [Commented] (SPARK-14666) Using DISTINCT on a UDF (like CONCAT) is not supported
[ https://issues.apache.org/jira/browse/SPARK-14666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412250#comment-15412250 ] Dongjoon Hyun commented on SPARK-14666: --- Great! > Using DISTINCT on a UDF (like CONCAT) is not supported > -- > > Key: SPARK-14666 > URL: https://issues.apache.org/jira/browse/SPARK-14666 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Dominic Ricard >Priority: Minor > Fix For: 2.0.0 > > > The following query fails with: > {noformat} > Java::JavaSql::SQLException: org.apache.spark.sql.AnalysisException: cannot > resolve 'column_1' given input columns: [_c0]; line # pos ## > {noformat} > Query: > {noformat} > select > distinct concat(column_1, ' : ', column_2) > from > table > order by > concat(column_1, ' : ', column_2); > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14666) Using DISTINCT on a UDF (like CONCAT) is not supported
[ https://issues.apache.org/jira/browse/SPARK-14666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Ricard resolved SPARK-14666. Resolution: Fixed Fix Version/s: 2.0.0 > Using DISTINCT on a UDF (like CONCAT) is not supported > -- > > Key: SPARK-14666 > URL: https://issues.apache.org/jira/browse/SPARK-14666 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Dominic Ricard >Priority: Minor > Fix For: 2.0.0 > > > The following query fails with: > {noformat} > Java::JavaSql::SQLException: org.apache.spark.sql.AnalysisException: cannot > resolve 'column_1' given input columns: [_c0]; line # pos ## > {noformat} > Query: > {noformat} > select > distinct concat(column_1, ' : ', column_2) > from > table > order by > concat(column_1, ' : ', column_2); > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14666) Using DISTINCT on a UDF (like CONCAT) is not supported
[ https://issues.apache.org/jira/browse/SPARK-14666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412243#comment-15412243 ] Dominic Ricard commented on SPARK-14666: It does indeed work in Spark 2.0. Thanks. > Using DISTINCT on a UDF (like CONCAT) is not supported > -- > > Key: SPARK-14666 > URL: https://issues.apache.org/jira/browse/SPARK-14666 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Dominic Ricard >Priority: Minor > > The following query fails with: > {noformat} > Java::JavaSql::SQLException: org.apache.spark.sql.AnalysisException: cannot > resolve 'column_1' given input columns: [_c0]; line # pos ## > {noformat} > Query: > {noformat} > select > distinct concat(column_1, ' : ', column_2) > from > table > order by > concat(column_1, ' : ', column_2); > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16951) Alternative implementation of NOT IN to Anti-join
[ https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-16951: Fix Version/s: (was: 2.1.0) > Alternative implementation of NOT IN to Anti-join > - > > Key: SPARK-16951 > URL: https://issues.apache.org/jira/browse/SPARK-16951 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > > A transformation currently used to process {{NOT IN}} subquery is to rewrite > to a form of Anti-join with null-aware property in the Logical Plan and then > translate to a form of {{OR}} predicate joining the parent side and the > subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} > predicate is limited to the nested-loop join execution plan, which will have > a major performance implication if both sides' results are large. > This JIRA sketches an idea of changing the OR predicate to a form similar to > the technique used in the implementation of the Existence join that addresses > the problem of {{EXISTS (..) OR ..}} type of queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16951) Alternative implementation of NOT IN to Anti-join
[ https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-16951: Target Version/s: (was: 2.1.0) > Alternative implementation of NOT IN to Anti-join > - > > Key: SPARK-16951 > URL: https://issues.apache.org/jira/browse/SPARK-16951 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > > A transformation currently used to process {{NOT IN}} subquery is to rewrite > to a form of Anti-join with null-aware property in the Logical Plan and then > translate to a form of {{OR}} predicate joining the parent side and the > subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} > predicate is limited to the nested-loop join execution plan, which will have > a major performance implication if both sides' results are large. > This JIRA sketches an idea of changing the OR predicate to a form similar to > the technique used in the implementation of the Existence join that addresses > the problem of {{EXISTS (..) OR ..}} type of queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16586) spark-class crash with "[: too many arguments" instead of displaying the correct error message
[ https://issues.apache.org/jira/browse/SPARK-16586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-16586. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.1.0 2.0.1 > spark-class crash with "[: too many arguments" instead of displaying the > correct error message > -- > > Key: SPARK-16586 > URL: https://issues.apache.org/jira/browse/SPARK-16586 > Project: Spark > Issue Type: Bug >Reporter: Xiang Gao >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > When trying to run spark on a machine that cannot provide enough memory for > java to use, instead of printing the correct error message, spark-class will > crash with {{spark-class: line 83: [: too many arguments}} > Simple shell commands to trigger this problem are: > {code} > ulimit -v 10 > ./sbin/start-master.sh > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16930) ApplicationMaster's code that waits for SparkContext is race-prone
[ https://issues.apache.org/jira/browse/SPARK-16930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16930: Assignee: (was: Apache Spark) > ApplicationMaster's code that waits for SparkContext is race-prone > -- > > Key: SPARK-16930 > URL: https://issues.apache.org/jira/browse/SPARK-16930 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Marcelo Vanzin >Priority: Minor > > While taking a look at SPARK-15937 and checking if there's something wrong > with the code, I noticed two races that explain the behavior. > Because they're really narrow races, I'm a little wary of declaring them the > cause of that bug. Also because the logs posted there don't really explain > what went wrong (and don't really look like a SparkContext was run at all). > The races I found are: > - it's possible, but very unlikely, for an application to instantiate a > SparkContext and stop it before the AM enters the loop where it checks for > the instance. > - it's possible, but very unlikely, for an application to stop the > SparkContext after the AM is already waiting for one, has been notified of > its creation, but hasn't yet stored the SparkContext reference in a local > variable. > I'll fix those and clean up the code a bit in the process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16930) ApplicationMaster's code that waits for SparkContext is race-prone
[ https://issues.apache.org/jira/browse/SPARK-16930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16930: Assignee: Apache Spark > ApplicationMaster's code that waits for SparkContext is race-prone > -- > > Key: SPARK-16930 > URL: https://issues.apache.org/jira/browse/SPARK-16930 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > While taking a look at SPARK-15937 and checking if there's something wrong > with the code, I noticed two races that explain the behavior. > Because they're really narrow races, I'm a little wary of declaring them the > cause of that bug. Also because the logs posted there don't really explain > what went wrong (and don't really look like a SparkContext was run at all). > The races I found are: > - it's possible, but very unlikely, for an application to instantiate a > SparkContext and stop it before the AM enters the loop where it checks for > the instance. > - it's possible, but very unlikely, for an application to stop the > SparkContext after the AM is already waiting for one, has been notified of > its creation, but hasn't yet stored the SparkContext reference in a local > variable. > I'll fix those and clean up the code a bit in the process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16930) ApplicationMaster's code that waits for SparkContext is race-prone
[ https://issues.apache.org/jira/browse/SPARK-16930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412130#comment-15412130 ] Apache Spark commented on SPARK-16930: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/14542 > ApplicationMaster's code that waits for SparkContext is race-prone > -- > > Key: SPARK-16930 > URL: https://issues.apache.org/jira/browse/SPARK-16930 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Marcelo Vanzin >Priority: Minor > > While taking a look at SPARK-15937 and checking if there's something wrong > with the code, I noticed two races that explain the behavior. > Because they're really narrow races, I'm a little wary of declaring them the > cause of that bug. Also because the logs posted there don't really explain > what went wrong (and don't really look like a SparkContext was run at all). > The races I found are: > - it's possible, but very unlikely, for an application to instantiate a > SparkContext and stop it before the AM enters the loop where it checks for > the instance. > - it's possible, but very unlikely, for an application to stop the > SparkContext after the AM is already waiting for one, has been notified of > its creation, but hasn't yet stored the SparkContext reference in a local > variable. > I'll fix those and clean up the code a bit in the process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org