[jira] [Deleted] (SPARK-13553) Migrate basic inspection operations
[ https://issues.apache.org/jira/browse/SPARK-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian deleted SPARK-13553: --- > Migrate basic inspection operations > --- > > Key: SPARK-13553 > URL: https://issues.apache.org/jira/browse/SPARK-13553 > Project: Spark > Issue Type: Sub-task >Reporter: Cheng Lian >Assignee: Cheng Lian > > Should migrate the following methods and corresponding tests to Dataset: > {noformat} > - Basic inspection operations > - dtypes > - columns > - printSchema > - explain > - Column accessors > - col > - apply > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-13555) Migrate untyped relational operations
[ https://issues.apache.org/jira/browse/SPARK-13555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian deleted SPARK-13555: --- > Migrate untyped relational operations > - > > Key: SPARK-13555 > URL: https://issues.apache.org/jira/browse/SPARK-13555 > Project: Spark > Issue Type: Sub-task >Reporter: Cheng Lian >Assignee: Cheng Lian > > Should migrate the following methods and corresponding tests to Dataset: > {noformat} > - Relational operations > - Untyped relational operations > - select(Column*): Dataset[Row] > - select(String, String*): Dataset[Row] > - selectExpr(String*): Dataset[Row] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-13556) Migrate untyped joins
[ https://issues.apache.org/jira/browse/SPARK-13556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian deleted SPARK-13556: --- > Migrate untyped joins > - > > Key: SPARK-13556 > URL: https://issues.apache.org/jira/browse/SPARK-13556 > Project: Spark > Issue Type: Sub-task >Reporter: Cheng Lian >Assignee: Cheng Lian > > Should migrate the following methods and corresponding tests to Dataset: > {noformat} > - Joins > - Untyped joins > - join[U: Encoder](Dataset[U]): Dataset[Row] > - join[U: Encoder](Dataset[U], String): Dataset[Row] > - join[U: Encoder](Dataset[U], Seq[String]): Dataset[Row] > - join[U: Encoder](Dataset[U], Seq[String], String): Dataset[Row] > - join[U: Encoder](Dataset[U], Column): Dataset[Row] > - join[U: Encoder](Dataset[U], Column, String): Dataset[Row] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-13557) Migrate gather-to-driver actions
[ https://issues.apache.org/jira/browse/SPARK-13557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian deleted SPARK-13557: --- > Migrate gather-to-driver actions > > > Key: SPARK-13557 > URL: https://issues.apache.org/jira/browse/SPARK-13557 > Project: Spark > Issue Type: Sub-task >Reporter: Cheng Lian > > Should migrate the following methods and corresponding tests to Dataset: > {noformat} > - Gater-to-driver actions > - head(Int): Array[T] > - head(): T > - first(): T > - collect(): Array[T] > - collectAsList(): java.util.List[T] > - take(Int): Array[T] > - takeAsList(Int): java.util.List[T] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-13558) Migrate basic GroupedDataset methods
[ https://issues.apache.org/jira/browse/SPARK-13558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian deleted SPARK-13558: --- > Migrate basic GroupedDataset methods > > > Key: SPARK-13558 > URL: https://issues.apache.org/jira/browse/SPARK-13558 > Project: Spark > Issue Type: Sub-task >Reporter: Cheng Lian >Assignee: Cheng Lian > > Should migrate the following methods and corresponding tests to Dataset: > {noformat} > - Aggregations > - GroupedDataset > - Support GroupType (GroupBy/GroupingSet/Rollup/Cube) > - Untyped aggregations > - agg((String, String), (String, String)*): Dataset[Row] > - agg(Map[String, String]): Dataset[Row] > - agg(java.util.Map[String, String]): Dataset[Row] > - agg(Column, Column*): Dataset[Row] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-13559) Migrate common GroupedDataset aggregations
[ https://issues.apache.org/jira/browse/SPARK-13559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian deleted SPARK-13559: --- > Migrate common GroupedDataset aggregations > -- > > Key: SPARK-13559 > URL: https://issues.apache.org/jira/browse/SPARK-13559 > Project: Spark > Issue Type: Sub-task >Reporter: Cheng Lian > > Should migrate the following methods and corresponding tests to Dataset: > {noformat} > - Aggregations > - GroupedDataset > - Common untyped aggregations > - mean(String*): Dataset[Row] > - max(String*): Dataset[Row] > - avg(String*): Dataset[Row] > - min(String*): Dataset[Row] > - sum(String*): Dataset[Row] > - Common typed aggregations > - count(): Dataset[(K, Long)] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-13560) Migrate GroupedDataset pivoting methods
[ https://issues.apache.org/jira/browse/SPARK-13560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian deleted SPARK-13560: --- > Migrate GroupedDataset pivoting methods > --- > > Key: SPARK-13560 > URL: https://issues.apache.org/jira/browse/SPARK-13560 > Project: Spark > Issue Type: Sub-task >Reporter: Cheng Lian > > Should migrate the following methods and corresponding tests to Dataset: > {noformat} > - Aggregations > - GroupedDataset > - Pivoting > - pivot(String): GroupedDataset[Row, V] > - pivot(String, Seq[Any]): GroupedDataset[Row, V] > - pivot(String, java.util.List[Any]): GroupedDataset[Row, V] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13817) Re-enable MiMA check after unifying DataFrame and Dataset API
Cheng Lian created SPARK-13817: -- Summary: Re-enable MiMA check after unifying DataFrame and Dataset API Key: SPARK-13817 URL: https://issues.apache.org/jira/browse/SPARK-13817 Project: Spark Issue Type: Test Components: Build Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian In [PR #11443|https://github.com/apache/spark/pull/11443], we unified DataFrame and Dataset API. Since this PR did tons of API changes, we disabled MiMA check temporarily for convenience. Now it is merged, we should re-enable MiMA check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-13564) Migrate DataFrameStatFunctions to Dataset
[ https://issues.apache.org/jira/browse/SPARK-13564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian deleted SPARK-13564: --- > Migrate DataFrameStatFunctions to Dataset > - > > Key: SPARK-13564 > URL: https://issues.apache.org/jira/browse/SPARK-13564 > Project: Spark > Issue Type: Sub-task >Reporter: Cheng Lian > > After the migration, we should have a separate namespace {{Dataset.stat}} for > statistics methods, just like {{DataFrame.stat}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-13563) Migrate DataFrameNaFunctions to Dataset
[ https://issues.apache.org/jira/browse/SPARK-13563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian deleted SPARK-13563: --- > Migrate DataFrameNaFunctions to Dataset > --- > > Key: SPARK-13563 > URL: https://issues.apache.org/jira/browse/SPARK-13563 > Project: Spark > Issue Type: Sub-task >Reporter: Cheng Lian > > After the migration, we should have a separate namespace {{Dataset.na}}, just > like {{DataFrame.na}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-13562) Migrate Dataset typed aggregations
[ https://issues.apache.org/jira/browse/SPARK-13562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian deleted SPARK-13562: --- > Migrate Dataset typed aggregations > -- > > Key: SPARK-13562 > URL: https://issues.apache.org/jira/browse/SPARK-13562 > Project: Spark > Issue Type: Sub-task >Reporter: Cheng Lian > > Should migrate the following methods and corresponding tests to Dataset: > {noformat} > - Aggregations > - Untyped aggregations (depends on GroupedDataset) > - groupBy(Column*): GroupedDataset[Row, T] > - groupBy(String, String*): GroupedDataset[Row, T] > - rollup(Column*): GroupedDataset[Row, T] > - rollup(String, String*): GroupedDataset[Row, T] > - cube(Column*): GroupedDataset[Row, T] > - cube(String, String*): GroupedDataset[Row, T] > - agg((String, String), (String, String)*): Dataset[Row] > - agg(Map[String, String]): Dataset[Row] > - agg(java.util.Map[String, String]): Dataset[Row] > - agg(Column, Column*): Dataset[Row] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-13561) Migrate Dataset untyped aggregations
[ https://issues.apache.org/jira/browse/SPARK-13561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian deleted SPARK-13561: --- > Migrate Dataset untyped aggregations > > > Key: SPARK-13561 > URL: https://issues.apache.org/jira/browse/SPARK-13561 > Project: Spark > Issue Type: Sub-task >Reporter: Cheng Lian > > Should migrate the following methods and corresponding tests to Dataset: > {noformat} > - Aggregations > - Typed aggregations (depends on GroupedDataset) > - groupBy[K: Encoder](T => K): GroupedDataset[K, T] // rename to > groupByKey > - groupBy[K](MapFunction[T, K], Encoder[K]): GroupedDataset[K, T] // > Rename to groupByKey > - count > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-13565) Migrate DataFrameReader/DataFrameWriter to Dataset API
[ https://issues.apache.org/jira/browse/SPARK-13565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian deleted SPARK-13565: --- > Migrate DataFrameReader/DataFrameWriter to Dataset API > -- > > Key: SPARK-13565 > URL: https://issues.apache.org/jira/browse/SPARK-13565 > Project: Spark > Issue Type: Sub-task >Reporter: Cheng Lian > > We'd like to be able to read/write a Dataset from/to specific data sources. > After the migration, we should have {{Dataset.read}}/{{Dataset.write}}, just > like {{DataFrame.read}}/{{DataFrame.write}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13822) Follow-ups of DataFrame/Dataset API unification
Cheng Lian created SPARK-13822: -- Summary: Follow-ups of DataFrame/Dataset API unification Key: SPARK-13822 URL: https://issues.apache.org/jira/browse/SPARK-13822 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian This is an umbrella ticket for all follow-up work of DataFrame/Dataset API unification (SPARK-13244). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13817) Re-enable MiMA check after unifying DataFrame and Dataset API
[ https://issues.apache.org/jira/browse/SPARK-13817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13817: --- Issue Type: Sub-task (was: Test) Parent: SPARK-13822 > Re-enable MiMA check after unifying DataFrame and Dataset API > - > > Key: SPARK-13817 > URL: https://issues.apache.org/jira/browse/SPARK-13817 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > In [PR #11443|https://github.com/apache/spark/pull/11443], we unified > DataFrame and Dataset API. Since this PR did tons of API changes, we disabled > MiMA check temporarily for convenience. Now it is merged, we should re-enable > MiMA check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13826) Revise ScalaDoc of the new Dataset API
Cheng Lian created SPARK-13826: -- Summary: Revise ScalaDoc of the new Dataset API Key: SPARK-13826 URL: https://issues.apache.org/jira/browse/SPARK-13826 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Tons of DataFrame operations were migrated to Dataset in SPARK-13244. We should revise ScalaDoc of these APIs. The following thing should be updated: - {{@since}} tag - {{@group}} tag - Example code -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13817) Re-enable MiMA check after unifying DataFrame and Dataset API
[ https://issues.apache.org/jira/browse/SPARK-13817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-13817. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11656 [https://github.com/apache/spark/pull/11656] > Re-enable MiMA check after unifying DataFrame and Dataset API > - > > Key: SPARK-13817 > URL: https://issues.apache.org/jira/browse/SPARK-13817 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.0.0 > > > In [PR #11443|https://github.com/apache/spark/pull/11443], we unified > DataFrame and Dataset API. Since this PR did tons of API changes, we disabled > MiMA check temporarily for convenience. Now it is merged, we should re-enable > MiMA check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13828) QueryExecution's assertAnalyzed needs to preserve the stacktrace
Cheng Lian created SPARK-13828: -- Summary: QueryExecution's assertAnalyzed needs to preserve the stacktrace Key: SPARK-13828 URL: https://issues.apache.org/jira/browse/SPARK-13828 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian SPARK-13244 made Dataset always eager analyzed, and added an extra {{plan}} argument to {{AnalysisException}} to facilitate logical plan analysis debugging using {{QueryExecution.assertAnalyzed}}. (Previously we used to temporarily disable DataFrame eager analysis to report the partially analyzed plan tree.) However, the exception stack trace wasn't properly preserved. It should be added back. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13841) Remove Dataset.collectRows() and Dataset.takeRows()
Cheng Lian created SPARK-13841: -- Summary: Remove Dataset.collectRows() and Dataset.takeRows() Key: SPARK-13841 URL: https://issues.apache.org/jira/browse/SPARK-13841 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian These two methods were added because the original {{DataFrame.collect()}} and {{DataFrame.take()}} methods became methods of {{Dataset\[T\]}} after merging DataFrame and Dataset API. However, Java doesn't allow returning generic array, and thus erased their return type to {{Object}}. This breaks compilation of Java code. After discussion, we decided to simply resort to existing {{collectAsList()}} and {{takeAsList()}} methods and remove these two extra specialized ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13841) Remove Dataset.collectRows() and Dataset.takeRows()
[ https://issues.apache.org/jira/browse/SPARK-13841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-13841. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11678 [https://github.com/apache/spark/pull/11678] > Remove Dataset.collectRows() and Dataset.takeRows() > --- > > Key: SPARK-13841 > URL: https://issues.apache.org/jira/browse/SPARK-13841 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.0.0 > > > These two methods were added because the original {{DataFrame.collect()}} and > {{DataFrame.take()}} methods became methods of {{Dataset\[T\]}} after merging > DataFrame and Dataset API. However, Java doesn't allow returning generic > array, and thus erased their return type to {{Object}}. This breaks > compilation of Java code. After discussion, we decided to simply resort to > existing {{collectAsList()}} and {{takeAsList()}} methods and remove these > two extra specialized ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12718) SQL generation support for window functions
[ https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-12718: --- Assignee: Wenchen Fan (was: Xiao Li) > SQL generation support for window functions > --- > > Key: SPARK-12718 > URL: https://issues.apache.org/jira/browse/SPARK-12718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan > > {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can > be useful for bootstrapping test coverage. Please refer to SPARK-11012 for > more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13910) Should provide a factory method for constructing DataFrames using unresolved logical plan
Cheng Lian created SPARK-13910: -- Summary: Should provide a factory method for constructing DataFrames using unresolved logical plan Key: SPARK-13910 URL: https://issues.apache.org/jira/browse/SPARK-13910 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Before merging DataFrame and Dataset, there is a public DataFrame constructor that accepts an unresolved logical plan. Now this constructor is gone and replaced by {{Dataset.newDataFrame}}, but {{object Dataset}} is marked as {{private\[sql\]}}. Should make this method public. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13911) Having condition and order by cannot both have aggregate functions
Cheng Lian created SPARK-13911: -- Summary: Having condition and order by cannot both have aggregate functions Key: SPARK-13911 URL: https://issues.apache.org/jira/browse/SPARK-13911 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1, 1.5.2, 1.4.1, 1.3.1, 2.0.0 Reporter: Cheng Lian Given the following temporary table: {code} sqlContext range 10 select ('id as 'a, 'id as 'b) registerTempTable "t" {code} The following SQL statement can't pass analysis: {noformat} scala> sqlContext sql "SELECT * FROM t GROUP BY a HAVING COUNT(b) > 0 ORDER BY COUNT(b)" show () org.apache.spark.sql.AnalysisException: expression '`t`.`b`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.; at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:36) at org.apache.spark.sql.Dataset$.newDataFrame(Dataset.scala:58) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:784) ... 49 elided {noformat} The reason is that analysis rule {{ResolveAggregateFunctions}} only handles the first {{Filter}} _or_ {{Sort}} directly above an {{Aggregate}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13911) Having condition and order by cannot both have aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-13911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195821#comment-15195821 ] Cheng Lian commented on SPARK-13911: Another problem related to the {{ResolveAggregateFunctions}} rule is that it invokes the analyzer recursively, which is pretty tricky to understand and maintain. Here is a possible fix that both fixes the above issue and removes the recursive invocation. Considering having condition and order by over aggregations are actually tightly coupled with aggregation during resolution, we probably shouldn't view them as separate constructs while resolving them. One possible fix of this issue is to introduce a new unresolved logical plan node {{UnresolvedAggregate}}: {code} case class UnresolvedAggregate( child: LogicalPlan, groupingExpressions: Seq[Expression], aggregateExpressions: Seq[NamedExpression], havingCondition: Option[Expression] = None, order: Seq[SortOrder] = Nil ) extends UnaryLogicalPlan {code} The major difference between {{UnresolvedAggregate}} and {{Aggregate}} is that it also contains an optional having condition and a list of sort orders. In other words, it's a filtered, ordered aggregate operator. Then, we can have two simple rules to merge all adjacent {{Sort}} and {{Filter}} operators directly above an {{UnresolvedAggregate}}: {code} object MergeHavingConditions extends Rule[LogicalPlan] { override def apply(tree: LogicalPlan): LogicalPlan = tree transformDown { case Filter(condition, (agg: UnresolvedAggregate)) => // Combines all having conditions val combinedCondition = (agg.havingCondition.toSeq :+ condition).reduce(And) agg.copy(havingCondition = Some(combinedCondition)) } } object MergeSortsOverAggregates extends Rule[LogicalPlan] { override def apply(tree: LogicalPlan): LogicalPlan = tree transformDown { case Sort(order, _, (agg: UnresolvedAggregate)) => // Only preserves the last sort order agg.copy(order = order) } } {code} (Of course, we also need to make the Dataset API and the {{GlobalAggregates}} produce {{UnresolvedAggregate}} instead of {{Aggregate}}.) At last, we only need to resolve {{UnresolvedAggregate}} into {{Aggregate}} with optional {{Filter}} and {{Sort}}, which is relatively straightforward. Now, we no long need to invoke the analyzer recursively in {{ResolveAggregateFunctions}} to resolve aggregate functions appearing in having and order by clauses, since they are already merged into {{UnresolvedAggregate}} and can be resolved all together with grouping expressions and aggregate expressions. cc yhuai > Having condition and order by cannot both have aggregate functions > -- > > Key: SPARK-13911 > URL: https://issues.apache.org/jira/browse/SPARK-13911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.1, 2.0.0 >Reporter: Cheng Lian > > Given the following temporary table: > {code} > sqlContext range 10 select ('id as 'a, 'id as 'b) registerTempTable "t" > {code} > The following SQL statement can't pass analysis: > {noformat} > scala> sqlContext sql "SELECT * FROM t GROUP BY a HAVING COUNT(b) > 0 ORDER > BY COUNT(b)" show () > org.apache.spark.sql.AnalysisException: expression '`t`.`b`' is neither > present in the group by, nor is it an aggregate function. Add to group by or > wrap in first() (or first_value) if you don't care which value you get.; > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:36) > at org.apache.spark.sql.Dataset$.newDataFrame(Dataset.scala:58) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:784) > ... 49 elided > {noformat} > The reason is that analysis rule {{ResolveAggregateFunctions}} only handles > the first {{Filter}} _or_ {{Sort}} directly above an {{Aggregate}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13911) Having condition and order by cannot both have aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-13911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195821#comment-15195821 ] Cheng Lian edited comment on SPARK-13911 at 3/15/16 6:15 PM: - Another problem related to the {{ResolveAggregateFunctions}} rule is that it invokes the analyzer recursively, which is pretty tricky to understand and maintain. Here is a possible fix that both fixes the above issue and removes the recursive invocation. Considering having condition and order by over aggregations are actually tightly coupled with aggregation during resolution, we probably shouldn't view them as separate constructs while resolving them. One possible fix of this issue is to introduce a new unresolved logical plan node {{UnresolvedAggregate}}: {code} case class UnresolvedAggregate( child: LogicalPlan, groupingExpressions: Seq[Expression], aggregateExpressions: Seq[NamedExpression], havingCondition: Option[Expression] = None, order: Seq[SortOrder] = Nil ) extends UnaryLogicalPlan {code} The major difference between {{UnresolvedAggregate}} and {{Aggregate}} is that it also contains an optional having condition and a list of sort orders. In other words, it's a filtered, ordered aggregate operator. Then, we can have two simple rules to merge all adjacent {{Sort}} and {{Filter}} operators directly above an {{UnresolvedAggregate}}: {code} object MergeHavingConditions extends Rule[LogicalPlan] { override def apply(tree: LogicalPlan): LogicalPlan = tree transformDown { case Filter(condition, (agg: UnresolvedAggregate)) => // Combines all having conditions val combinedCondition = (agg.havingCondition.toSeq :+ condition).reduce(And) agg.copy(havingCondition = Some(combinedCondition)) } } object MergeSortsOverAggregates extends Rule[LogicalPlan] { override def apply(tree: LogicalPlan): LogicalPlan = tree transformDown { case Sort(order, _, (agg: UnresolvedAggregate)) => // Only preserves the last sort order agg.copy(order = order) } } {code} (Of course, we also need to make the Dataset API and the {{GlobalAggregates}} produce {{UnresolvedAggregate}} instead of {{Aggregate}}.) At last, we only need to resolve {{UnresolvedAggregate}} into {{Aggregate}} with optional {{Filter}} and {{Sort}}, which is relatively straightforward. Now, we no long need to invoke the analyzer recursively in {{ResolveAggregateFunctions}} to resolve aggregate functions appearing in having and order by clauses, since they are already merged into {{UnresolvedAggregate}} and can be resolved all together with grouping expressions and aggregate expressions. cc [~yhuai] was (Author: lian cheng): Another problem related to the {{ResolveAggregateFunctions}} rule is that it invokes the analyzer recursively, which is pretty tricky to understand and maintain. Here is a possible fix that both fixes the above issue and removes the recursive invocation. Considering having condition and order by over aggregations are actually tightly coupled with aggregation during resolution, we probably shouldn't view them as separate constructs while resolving them. One possible fix of this issue is to introduce a new unresolved logical plan node {{UnresolvedAggregate}}: {code} case class UnresolvedAggregate( child: LogicalPlan, groupingExpressions: Seq[Expression], aggregateExpressions: Seq[NamedExpression], havingCondition: Option[Expression] = None, order: Seq[SortOrder] = Nil ) extends UnaryLogicalPlan {code} The major difference between {{UnresolvedAggregate}} and {{Aggregate}} is that it also contains an optional having condition and a list of sort orders. In other words, it's a filtered, ordered aggregate operator. Then, we can have two simple rules to merge all adjacent {{Sort}} and {{Filter}} operators directly above an {{UnresolvedAggregate}}: {code} object MergeHavingConditions extends Rule[LogicalPlan] { override def apply(tree: LogicalPlan): LogicalPlan = tree transformDown { case Filter(condition, (agg: UnresolvedAggregate)) => // Combines all having conditions val combinedCondition = (agg.havingCondition.toSeq :+ condition).reduce(And) agg.copy(havingCondition = Some(combinedCondition)) } } object MergeSortsOverAggregates extends Rule[LogicalPlan] { override def apply(tree: LogicalPlan): LogicalPlan = tree transformDown { case Sort(order, _, (agg: UnresolvedAggregate)) => // Only preserves the last sort order agg.copy(order = order) } } {code} (Of course, we also need to make the Dataset API and the {{GlobalAggregates}} produce {{UnresolvedAggregate}} instead of {{Aggregate}}.) At last, we only need to resolve {{UnresolvedAggregate}} into {{Aggregate}} with optional {{Filter}} and {{Sort}}, which is relatively straightforw
[jira] [Resolved] (SPARK-13972) hive tests should fail if SQL generation failed
[ https://issues.apache.org/jira/browse/SPARK-13972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-13972. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11782 [https://github.com/apache/spark/pull/11782] > hive tests should fail if SQL generation failed > --- > > Key: SPARK-13972 > URL: https://issues.apache.org/jira/browse/SPARK-13972 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14001) support multi-children Union in SQLBuilder
[ https://issues.apache.org/jira/browse/SPARK-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14001. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11818 [https://github.com/apache/spark/pull/11818] > support multi-children Union in SQLBuilder > -- > > Key: SPARK-14001 > URL: https://issues.apache.org/jira/browse/SPARK-14001 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14002) SQLBuilder should add subquery to Aggregate child when necessary
Cheng Lian created SPARK-14002: -- Summary: SQLBuilder should add subquery to Aggregate child when necessary Key: SPARK-14002 URL: https://issues.apache.org/jira/browse/SPARK-14002 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Adding the following test case to {{LogicalPlanToSQLSuite}} to reproduce this issue: {code} test("bug") { checkHiveQl( """SELECT COUNT(id) |FROM |( | SELECT id FROM t0 |) subq """.stripMargin ) } {code} Generated wrong SQL is: {code:sql} SELECT `gen_attr_46` AS `count(id)` FROM ( SELECT count(`gen_attr_45`) AS `gen_attr_46` FROM SELECT `gen_attr_45`-- FROM-- ( -- A subquery SELECT `id` AS `gen_attr_45`-- is missing FROM `default`.`t0` -- ) AS gen_subquery_0 -- ) AS gen_subquery_1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14004) AttributeReference and Alias should only use their first qualifier to build SQL representations
Cheng Lian created SPARK-14004: -- Summary: AttributeReference and Alias should only use their first qualifier to build SQL representations Key: SPARK-14004 URL: https://issues.apache.org/jira/browse/SPARK-14004 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Current implementation joins all qualifiers, which is wrong. However, this doesn't cause any real SQL generation bugs as there is always at most one qualifier for any given {{AttributeReference}} or {{Alias}}. We can probably use {{Option\[String\]}} instead of {{Seq\[String\]}} to represent qualifiers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14004) AttributeReference and Alias should only use their first qualifier to build SQL representations
[ https://issues.apache.org/jira/browse/SPARK-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14004: --- Priority: Minor (was: Major) > AttributeReference and Alias should only use their first qualifier to build > SQL representations > --- > > Key: SPARK-14004 > URL: https://issues.apache.org/jira/browse/SPARK-14004 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > Current implementation joins all qualifiers, which is wrong. > However, this doesn't cause any real SQL generation bugs as there is always > at most one qualifier for any given {{AttributeReference}} or {{Alias}}. > We can probably use {{Option\[String\]}} instead of {{Seq\[String\]}} to > represent qualifiers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13974) sub-query names do not need to be globally unique while generate SQL
[ https://issues.apache.org/jira/browse/SPARK-13974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13974: --- Assignee: Wenchen Fan > sub-query names do not need to be globally unique while generate SQL > > > Key: SPARK-13974 > URL: https://issues.apache.org/jira/browse/SPARK-13974 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14002) SQLBuilder should add subquery to Aggregate child when necessary
[ https://issues.apache.org/jira/browse/SPARK-14002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14002. Resolution: Duplicate Fix Version/s: 2.0.0 This issue is actually covered by SPARK-13976. > SQLBuilder should add subquery to Aggregate child when necessary > > > Key: SPARK-14002 > URL: https://issues.apache.org/jira/browse/SPARK-14002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.0.0 > > > Adding the following test case to {{LogicalPlanToSQLSuite}} to reproduce this > issue: > {code} > test("bug") { > checkHiveQl( > """SELECT COUNT(id) > |FROM > |( > | SELECT id FROM t0 > |) subq > """.stripMargin > ) > } > {code} > Generated wrong SQL is: > {code:sql} > SELECT `gen_attr_46` AS `count(id)` > FROM > ( > SELECT count(`gen_attr_45`) AS `gen_attr_46` > FROM > SELECT `gen_attr_45`-- > FROM-- > ( -- A subquery > SELECT `id` AS `gen_attr_45`-- is missing > FROM `default`.`t0` -- > ) AS gen_subquery_0 -- > ) AS gen_subquery_1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12719) SQL generation support for generators (including UDTF)
[ https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-12719: --- Assignee: Wenchen Fan > SQL generation support for generators (including UDTF) > -- > > Key: SPARK-12719 > URL: https://issues.apache.org/jira/browse/SPARK-12719 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14001) support multi-children Union in SQLBuilder
[ https://issues.apache.org/jira/browse/SPARK-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14001: --- Assignee: Wenchen Fan > support multi-children Union in SQLBuilder > -- > > Key: SPARK-14001 > URL: https://issues.apache.org/jira/browse/SPARK-14001 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13972) hive tests should fail if SQL generation failed
[ https://issues.apache.org/jira/browse/SPARK-13972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13972: --- Assignee: Wenchen Fan > hive tests should fail if SQL generation failed > --- > > Key: SPARK-13972 > URL: https://issues.apache.org/jira/browse/SPARK-13972 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14004) AttributeReference and Alias should only use their first qualifier to build SQL representations
[ https://issues.apache.org/jira/browse/SPARK-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14004. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11820 [https://github.com/apache/spark/pull/11820] > AttributeReference and Alias should only use their first qualifier to build > SQL representations > --- > > Key: SPARK-14004 > URL: https://issues.apache.org/jira/browse/SPARK-14004 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > Fix For: 2.0.0 > > > Current implementation joins all qualifiers, which is wrong. > However, this doesn't cause any real SQL generation bugs as there is always > at most one qualifier for any given {{AttributeReference}} or {{Alias}}. > We can probably use {{Option\[String\]}} instead of {{Seq\[String\]}} to > represent qualifiers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14000) case class with a tuple field can't work in Dataset
[ https://issues.apache.org/jira/browse/SPARK-14000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14000. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11816 [https://github.com/apache/spark/pull/11816] > case class with a tuple field can't work in Dataset > --- > > Key: SPARK-14000 > URL: https://issues.apache.org/jira/browse/SPARK-14000 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Wenchen Fan > Fix For: 2.0.0 > > > for example, `case class TupleClass(data: (Int, String))`, we can create > encoder for it, but when we create Dataset with it, we will fail while > validating the encoder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14038) Enable native view by default
[ https://issues.apache.org/jira/browse/SPARK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14038: --- Assignee: Wenchen Fan > Enable native view by default > - > > Key: SPARK-14038 > URL: https://issues.apache.org/jira/browse/SPARK-14038 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14038) Enable native view by default
[ https://issues.apache.org/jira/browse/SPARK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14038: --- Affects Version/s: 2.0.0 Target Version/s: 2.0.0 > Enable native view by default > - > > Key: SPARK-14038 > URL: https://issues.apache.org/jira/browse/SPARK-14038 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14038) Enable native view by default
[ https://issues.apache.org/jira/browse/SPARK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14038: --- Labels: releasenotes (was: ) > Enable native view by default > - > > Key: SPARK-14038 > URL: https://issues.apache.org/jira/browse/SPARK-14038 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Labels: releasenotes > > Release note update: > {quote} > Starting from 2.0.0, Spark SQL handles views natively by default. When > defining a view, now Spark SQL canonicalizes view definition by generating a > canonical SQL statement from the parsed logical query plan, and then stores > it into the catalog. If you hit any problems, you may try to turn off native > view by setting {{spark.sql.nativeView}} to false. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14038) Enable native view by default
[ https://issues.apache.org/jira/browse/SPARK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14038: --- Description: Release note update: {quote} Starting from 2.0.0, Spark SQL handles views natively by default. When defining a view, now Spark SQL canonicalizes view definition by generating a canonical SQL statement from the parsed logical query plan, and then stores it into the catalog. If you hit any problems, you may try to turn off native view by setting {{spark.sql.nativeView}} to false. {quote} > Enable native view by default > - > > Key: SPARK-14038 > URL: https://issues.apache.org/jira/browse/SPARK-14038 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Labels: releasenotes > > Release note update: > {quote} > Starting from 2.0.0, Spark SQL handles views natively by default. When > defining a view, now Spark SQL canonicalizes view definition by generating a > canonical SQL statement from the parsed logical query plan, and then stores > it into the catalog. If you hit any problems, you may try to turn off native > view by setting {{spark.sql.nativeView}} to false. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13774) IllegalArgumentException: Can not create a Path from an empty string for incorrect file path
[ https://issues.apache.org/jira/browse/SPARK-13774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13774: --- Assignee: Sunitha Kambhampati > IllegalArgumentException: Can not create a Path from an empty string for > incorrect file path > > > Key: SPARK-13774 > URL: https://issues.apache.org/jira/browse/SPARK-13774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Assignee: Sunitha Kambhampati >Priority: Minor > > Think the error message should be improved for files that could not be found. > The {{Path}} seems given. > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT > /_/ > Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv") > java.lang.IllegalArgumentException: Can not create a Path from an empty string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126) > at org.apache.hadoop.fs.Path.(Path.java:134) > at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:245) > at > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:976) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:976) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177) > at scala.Option.map(Option.scala:146) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:177) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:352) > at org.apache.spark.rdd.RDD.take(RDD.scala:1246) > at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:352) > at org.apache.spark.rdd.RDD.first(RDD.scala:1285) > at > org.apache.spark.sql.execution.datasources.csv.DefaultSource.findFirstLine(DefaultSource.scala:156) > at > org.apache.spark.sql.execution.datasources.csv.DefaultSource.inferSchema(DefaultSource.scala:58) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$13.apply(DataSource.scala:213) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$13.apply(DataSource.scala:213) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:212) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141) > ... 49 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13774) IllegalArgumentException: Can not create a Path from an empty string for incorrect file path
[ https://issues.apache.org/jira/browse/SPARK-13774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-13774. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11775 [https://github.com/apache/spark/pull/11775] > IllegalArgumentException: Can not create a Path from an empty string for > incorrect file path > > > Key: SPARK-13774 > URL: https://issues.apache.org/jira/browse/SPARK-13774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Assignee: Sunitha Kambhampati >Priority: Minor > Fix For: 2.0.0 > > > Think the error message should be improved for files that could not be found. > The {{Path}} seems given. > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT > /_/ > Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv") > java.lang.IllegalArgumentException: Can not create a Path from an empty string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126) > at org.apache.hadoop.fs.Path.(Path.java:134) > at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:245) > at > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:976) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:976) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177) > at scala.Option.map(Option.scala:146) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:177) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:352) > at org.apache.spark.rdd.RDD.take(RDD.scala:1246) > at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:352) > at org.apache.spark.rdd.RDD.first(RDD.scala:1285) > at > org.apache.spark.sql.execution.datasources.csv.DefaultSource.findFirstLine(DefaultSource.scala:156) > at > org.apache.spark.sql.execution.datasources.csv.DefaultSource.inferSchema(DefaultSource.scala:58) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$13.apply(DataSource.scala:213) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$13.apply(DataSource.scala:213) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:212) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141) > ... 49 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spar
[jira] [Updated] (SPARK-13772) DataType mismatch about decimal
[ https://issues.apache.org/jira/browse/SPARK-13772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13772: --- Description: Code snippet to reproduce this issue using 1.6.0: {code} select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test {code} It will throw exceptions like this: {noformat} Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and decimal(10,0)).; line 1 pos 37 {noformat} I also tested: {code} select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test; {code} {noformat} Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' (decimal(10,0) and decimal(19,6)).; line 1 pos 38 {noformat} was: I found a bug: select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test It will throw exceptions like this: Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and decimal(10,0)).; line 1 pos 37 I also test: select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test; Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' (decimal(10,0) and decimal(19,6)).; line 1 pos 38 > DataType mismatch about decimal > --- > > Key: SPARK-13772 > URL: https://issues.apache.org/jira/browse/SPARK-13772 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: spark1.6.0 hadoop2.2.0 jdk1.7.0_79 >Reporter: cen yuhai > > Code snippet to reproduce this issue using 1.6.0: > {code} > select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test > {code} > It will throw exceptions like this: > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 > as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = > 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and > decimal(10,0)).; line 1 pos 37 > {noformat} > I also tested: > {code} > select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test; > {code} > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else > cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if > ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' > (decimal(10,0) and decimal(19,6)).; line 1 pos 38 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13772) DataType mismatch about decimal
[ https://issues.apache.org/jira/browse/SPARK-13772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13772: --- Target Version/s: 1.6.2 > DataType mismatch about decimal > --- > > Key: SPARK-13772 > URL: https://issues.apache.org/jira/browse/SPARK-13772 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: spark1.6.0 hadoop2.2.0 jdk1.7.0_79 >Reporter: cen yuhai > > Code snippet to reproduce this issue using 1.6.0: > {code} > select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test > {code} > It will throw exceptions like this: > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 > as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = > 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and > decimal(10,0)).; line 1 pos 37 > {noformat} > I also tested: > {code} > select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test; > {code} > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else > cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if > ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' > (decimal(10,0) and decimal(19,6)).; line 1 pos 38 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13772) DataType mismatch about decimal
[ https://issues.apache.org/jira/browse/SPARK-13772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13772: --- Assignee: cen yuhai > DataType mismatch about decimal > --- > > Key: SPARK-13772 > URL: https://issues.apache.org/jira/browse/SPARK-13772 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: spark1.6.0 hadoop2.2.0 jdk1.7.0_79 >Reporter: cen yuhai >Assignee: cen yuhai > > Code snippet to reproduce this issue using 1.6.0: > {code} > select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test > {code} > It will throw exceptions like this: > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 > as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = > 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and > decimal(10,0)).; line 1 pos 37 > {noformat} > I also tested: > {code} > select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test; > {code} > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else > cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if > ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' > (decimal(10,0) and decimal(19,6)).; line 1 pos 38 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13772) DataType mismatch about decimal
[ https://issues.apache.org/jira/browse/SPARK-13772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-13772. Resolution: Fixed Fix Version/s: 1.6.2 Issue resolved by pull request 11605 [https://github.com/apache/spark/pull/11605] > DataType mismatch about decimal > --- > > Key: SPARK-13772 > URL: https://issues.apache.org/jira/browse/SPARK-13772 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: spark1.6.0 hadoop2.2.0 jdk1.7.0_79 >Reporter: cen yuhai >Assignee: cen yuhai > Fix For: 1.6.2 > > > Code snippet to reproduce this issue using 1.6.0: > {code} > select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test > {code} > It will throw exceptions like this: > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 > as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = > 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and > decimal(10,0)).; line 1 pos 37 > {noformat} > I also tested: > {code} > select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test; > {code} > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else > cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if > ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' > (decimal(10,0) and decimal(19,6)).; line 1 pos 38 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names
Cheng Lian created SPARK-3414: - Summary: Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names Key: SPARK-3414 URL: https://issues.apache.org/jira/browse/SPARK-3414 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Cheng Lian Priority: Critical Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs") sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles") val srdd = sql( """ SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name """) srdd.registerTempTable("boom") sql("select * from boom") {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is now lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} is only executed once. When {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} attributes referenced in the join operator is now lowercased yet. And then, when {{select * from boom}} is been analyzed, the input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, which is later discovered by rule {{ResolveRelations}}: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Because the {{CaseInsensitiveAttributeReferences}} batch happens before the {{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) is not lowercased, and thus causes the resolution failure. A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3414: -- Description: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs") sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles") val srdd = sql( """ SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name """) srdd.registerTempTable("boom") sql("select * from boom") {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is not lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} batch is only executed before the {{Resolution}} batch. When {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} attributes referenced in the join operator is now lowercased yet. And then, when {{select * from boom}} is been analyzed, the input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, which is later discovered by rule {{ResolveRelations}}: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Because the {{CaseInsensitiveAttributeReferences}} batch happens before the {{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) is not lowercased, and thus causes the resolution failure. A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. was: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs") sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles") val srdd = sql( """ SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name """) srdd.registerTempTable("boom") sql("select * from boom") {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,messag
[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3414: -- Description: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs") sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles") val srdd = sql( """ SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name """) srdd.registerTempTable("boom") sql("select * from boom") {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is not lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} is only executed once. When {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} attributes referenced in the join operator is now lowercased yet. And then, when {{select * from boom}} is been analyzed, the input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, which is later discovered by rule {{ResolveRelations}}: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Because the {{CaseInsensitiveAttributeReferences}} batch happens before the {{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) is not lowercased, and thus causes the resolution failure. A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. was: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs") sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles") val srdd = sql( """ SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name """) srdd.registerTempTable("boom") sql("select * from boom") {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapP
[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3414: -- Summary: Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names (was: Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names) > Case insensitivity breaks when unresolved relation contains attributes with > uppercase letters in their names > > > Key: SPARK-3414 > URL: https://issues.apache.org/jira/browse/SPARK-3414 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.2 >Reporter: Cheng Lian >Priority: Critical > > Paste the following snippet to {{spark-shell}} (need Hive support) to > reproduce this issue: > {code} > import org.apache.spark.sql.hive.HiveContext > val hiveContext = new HiveContext(sc) > import hiveContext._ > case class LogEntry(filename: String, message: String) > case class LogFile(name: String) > sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs") > sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles") > val srdd = sql( > """ > SELECT name, message > FROM rawLogs > JOIN ( > SELECT name > FROM logFiles > ) files > ON rawLogs.filename = files.name > """) > srdd.registerTempTable("boom") > sql("select * from boom") > {code} > Exception thrown: > {code} > SchemaRDD[7] at RDD at SchemaRDD.scala:103 > == Query Plan == > == Physical Plan == > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved > attributes: *, tree: > Project [*] > LowerCaseSchema > Subquery boom >Project ['name,'message] > Join Inner, Some(('rawLogs.filename = name#2)) > LowerCaseSchema > Subquery rawlogs >SparkLogicalPlan (ExistingRdd [filename#0,message#1], > MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) > Subquery files > Project [name#2] >LowerCaseSchema > Subquery logfiles > SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at > mapPartitions at basicOperators.scala:208) > {code} > Notice that {{rawLogs}} in the join operator is not lowercased. > The reason is that, during analysis phase, the > {{CaseInsensitiveAttributeReferences}} batch is only executed before the > {{Resolution}} batch. > When {{srdd}} is registered as temporary table {{boom}}, its original > (unanalyzed) logical plan is stored into the catalog: > {code} > Join Inner, Some(('rawLogs.filename = 'files.name)) > UnresolvedRelation None, rawLogs, None > Subquery files > Project ['name] >UnresolvedRelation None, logFiles, None > {code} > notice that attributes referenced in the join operator (esp. {{rawLogs}}) is > not lowercased yet. > And then, when {{select * from boom}} is been analyzed, its input logical > plan is: > {code} > Project [*] > UnresolvedRelation None, boom, None > {code} > here the unresolved relation points to the unanalyzed logical plan of > {{srdd}}, which is later discovered by rule {{ResolveRelations}}: > {code} > === Applying Rule > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === > Project [*]Project [*] > ! UnresolvedRelation None, boom, NoneLowerCaseSchema > ! Subquery boom > ! Project ['name,'message] > ! Join Inner, > Some(('rawLogs.filename = 'files.name)) > !LowerCaseSchema > ! Subquery rawlogs > ! SparkLogicalPlan (ExistingRdd > [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at > basicOperators.scala:208) > !Subquery files > ! Project ['name] > ! LowerCaseSchema > ! Subquery logfiles > !SparkLogicalPlan > (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at > basicOperators.scala:208) > {code} > Because the {{CaseInsensitiveAttributeReferences}} batch happens before the > {{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) > is not lowercased, and thus causes the resolution failure. > A reasonable fix for this could be always register analyzed logical plan to > the catalog when registering temporary tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) -
[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3414: -- Description: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs") sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles") val srdd = sql( """ SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name """) srdd.registerTempTable("boom") sql("select * from boom") {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is not lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} batch is only executed before the {{Resolution}} batch. When {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} notice that attributes referenced in the join operator (esp. {{rawLogs}}) is not lowercased yet. And then, when {{select * from boom}} is been analyzed, its input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, which is later discovered by rule {{ResolveRelations}}: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Because the {{CaseInsensitiveAttributeReferences}} batch happens before the {{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) is not lowercased, and thus causes the resolution failure. A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. was: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs") sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles") val srdd = sql( """ SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name """) srdd.registerTempTable("boom") sql("select * from boom") {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan
[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3414: -- Description: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs") sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles") val srdd = sql( """ SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name """) srdd.registerTempTable("boom") sql("select * from boom") {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is not lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} batch is only executed before the {{Resolution}} batch. And when {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} notice that attributes referenced in the join operator (esp. {{rawLogs}}) is not lowercased yet. And then, when {{select * from boom}} is been analyzed, its input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}} above, which is later discovered by rule {{ResolveRelations}}, thus not touched by {{CaseInsensitiveAttributeReferences}} at all: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Because the {{CaseInsensitiveAttributeReferences}} batch happens before the {{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) is not lowercased, and thus causes the resolution failure. A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. was: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs") sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles") val srdd = sql( """ SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name """) srdd.registerTempTable("boom") sql("select * from boom") {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename =
[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3414: -- Description: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs") sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles") val srdd = sql( """ SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name """) srdd.registerTempTable("boom") sql("select * from boom") {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is not lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} batch is only executed before the {{Resolution}} batch. And when {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} notice that attributes referenced in the join operator (esp. {{rawLogs}}) is not lowercased yet. And then, when {{select * from boom}} is been analyzed, its input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}} above, which is later discovered by rule {{ResolveRelations}}, thus not touched by {{CaseInsensitiveAttributeReferences}} at all, and {{rawLogs.filename}} is thus not lowercased: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. was: Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs") sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles") val srdd = sql( """ SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name """) srdd.registerTempTable("boom") sql("select * from boom") {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOper
[jira] [Created] (SPARK-3421) StructField.toString should quote the name field to allow arbitrary character as struct field name
Cheng Lian created SPARK-3421: - Summary: StructField.toString should quote the name field to allow arbitrary character as struct field name Key: SPARK-3421 URL: https://issues.apache.org/jira/browse/SPARK-3421 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Cheng Lian The original use case is something like this: {code} // JSON snippet with "illegal" characters in field names val json = """{ "a(b)": { "c(d)": "hello" } }""" :: """{ "a(b)": { "c(d)": "world" } }""" :: Nil val jsonSchemaRdd = sqlContext.jsonRDD(sparkContext.makeRDD(json)) jsonSchemaRdd.saveAsParquetFile("/tmp/file.parquet") java.lang.Exception: java.lang.RuntimeException: Unsupported dataType: StructType(ArrayBuffer(StructField(a(b),StructType(ArrayBuffer(StructField(c(d),StringType,true))),true))), [1.37] failure: `,' expected but `(' found {code} The reason is that, the {{DataType}} parser only allows {{\[a-zA-Z0-9_\]*}} as struct field name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2537) Workaround Timezone specific Hive tests
[ https://issues.apache.org/jira/browse/SPARK-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123556#comment-14123556 ] Cheng Lian commented on SPARK-2537: --- PR [#1440|https://github.com/apache/spark/pull/1440] fixes this issue. > Workaround Timezone specific Hive tests > --- > > Key: SPARK-2537 > URL: https://issues.apache.org/jira/browse/SPARK-2537 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.1, 1.1.0 >Reporter: Cheng Lian >Priority: Minor > > Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive: > - {{timestamp_1}} > - {{timestamp_2}} > - {{timestamp_3}} > - {{timestamp_udf}} > Their answers differ between different timezones. Caching golden answers > naively cause build failures in other timezones. Currently these tests are > blacklisted. A not so clever solution is to cache golden answers of all > timezones for these tests, then select the right version for the current > build according to system timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2537) Workaround Timezone specific Hive tests
[ https://issues.apache.org/jira/browse/SPARK-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-2537. --- Resolution: Fixed Fix Version/s: 1.1.0 Target Version/s: 1.1.0 > Workaround Timezone specific Hive tests > --- > > Key: SPARK-2537 > URL: https://issues.apache.org/jira/browse/SPARK-2537 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.1, 1.1.0 >Reporter: Cheng Lian >Priority: Minor > Fix For: 1.1.0 > > > Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive: > - {{timestamp_1}} > - {{timestamp_2}} > - {{timestamp_3}} > - {{timestamp_udf}} > Their answers differ between different timezones. Caching golden answers > naively cause build failures in other timezones. Currently these tests are > blacklisted. A not so clever solution is to cache golden answers of all > timezones for these tests, then select the right version for the current > build according to system timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3440) HiveServer2 and CLI should retrieve Hive result set schema
Cheng Lian created SPARK-3440: - Summary: HiveServer2 and CLI should retrieve Hive result set schema Key: SPARK-3440 URL: https://issues.apache.org/jira/browse/SPARK-3440 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.1.0 Reporter: Cheng Lian When executing Hive native queries/commands with {{HiveContext.runHive}}, Spark SQL only calls {{Driver.getResults}} and returns a {{Seq\[String\]}}. The schema of the result set is not retrieved, and thus not possible to split the row string into proper columns and assign column names to them. For example, currently all {{NativeCommand}} only returns a single column named {{result}}. For existing Hive applications that rely on result set schemas, this breaks compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3448) SpecificMutableRow.update doesn't check for null
Cheng Lian created SPARK-3448: - Summary: SpecificMutableRow.update doesn't check for null Key: SPARK-3448 URL: https://issues.apache.org/jira/browse/SPARK-3448 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian Priority: Minor Fix For: 1.1.1 {code} test("SpecificMutableRow.update with null") { val row = new SpecificMutableRow(Seq(IntegerType)) row(0) = null assert(row.isNullAt(0)) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3515) ParquetMetastoreSuite fails when executed together with other suites under Maven
Cheng Lian created SPARK-3515: - Summary: ParquetMetastoreSuite fails when executed together with other suites under Maven Key: SPARK-3515 URL: https://issues.apache.org/jira/browse/SPARK-3515 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.1 Reporter: Cheng Lian Reproduction step: {code} mvn -Phive,hadoop-2.4 -DwildcardSuites=org.apache.spark.sql.parquet.ParquetMetastoreSuite,org.apache.spark.sql.hive.StatisticsSuite -pl core,sql/catalyst,sql/core,sql/hive test {code} Maven instantiates all discovered test suite object at first, and then starts executing all test cases. {{ParquetMetastoreSuite}} sets up several temporary tables in constructor, but these tables are deleted immediately since {{StatisticsSuite}}'s constructor calls {{TestHiveContext.reset()}}. To fix this issue, we shouldn't put this kind of side effect in constructor, but in {{beforeAll}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3515) ParquetMetastoreSuite fails when executed together with other suites under Maven
[ https://issues.apache.org/jira/browse/SPARK-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132406#comment-14132406 ] Cheng Lian commented on SPARK-3515: --- The bug SPARK-3481 fixed actually covered up the bug mentioned in this ticket. > ParquetMetastoreSuite fails when executed together with other suites under > Maven > > > Key: SPARK-3515 > URL: https://issues.apache.org/jira/browse/SPARK-3515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.1 >Reporter: Cheng Lian > > Reproduction step: > {code} > mvn -Phive,hadoop-2.4 > -DwildcardSuites=org.apache.spark.sql.parquet.ParquetMetastoreSuite,org.apache.spark.sql.hive.StatisticsSuite > -pl core,sql/catalyst,sql/core,sql/hive test > {code} > Maven instantiates all discovered test suite object at first, and then starts > executing all test cases. {{ParquetMetastoreSuite}} sets up several temporary > tables in constructor, but these tables are deleted immediately since > {{StatisticsSuite}}'s constructor calls {{TestHiveContext.reset()}}. > To fix this issue, we shouldn't put this kind of side effect in constructor, > but in {{beforeAll}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3552) Thrift server doesn't reset current database for each connection
Cheng Lian created SPARK-3552: - Summary: Thrift server doesn't reset current database for each connection Key: SPARK-3552 URL: https://issues.apache.org/jira/browse/SPARK-3552 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian Reproduction steps: - Start Thrift server - Connect with beeline {code} ./bin/beeline -u jdbc:hive2://localhost:1/default -n lian {code} - Create an empty database and switch to it {code} 0: jdbc:hive2://localhost:1/default> create database test; 0: jdbc:hive2://localhost:1/default> use test; {code} - Exit beeline and reconnect, specify current database to "default" {code} ./bin/beeline -u jdbc:hive2://localhost:1/default -n lian {code} - Now {{SHOW TABLES}} returns nothing indicating that the current database is still {{test}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3609) Add sizeInBytes statistics to Limit operator
Cheng Lian created SPARK-3609: - Summary: Add sizeInBytes statistics to Limit operator Key: SPARK-3609 URL: https://issues.apache.org/jira/browse/SPARK-3609 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian The {{sizeInBytes}} statistics of a {{LIMIT}} operator can be estimated fairly precisely when all output attributes are of native data types, all native data types except {{StringType}} have fixed size. For {{StringType}}, we can use a relatively large (say 4K) default size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2271) Use Hive's high performance Decimal128 to replace BigDecimal
[ https://issues.apache.org/jira/browse/SPARK-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141510#comment-14141510 ] Cheng Lian commented on SPARK-2271: --- [~pwendell] I can't find a Maven artifact for this. From the Hive JIRA Reynold pointed out, the {{Decimal128}} comes from Microsoft PolyBase, which I think is not open source. > Use Hive's high performance Decimal128 to replace BigDecimal > > > Key: SPARK-2271 > URL: https://issues.apache.org/jira/browse/SPARK-2271 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Cheng Lian > > Hive JIRA: https://issues.apache.org/jira/browse/HIVE-6017 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3654) Implement all extended HiveQL statements/commands with a separate parser combinator
Cheng Lian created SPARK-3654: - Summary: Implement all extended HiveQL statements/commands with a separate parser combinator Key: SPARK-3654 URL: https://issues.apache.org/jira/browse/SPARK-3654 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian Statements and commands like {{SET}}, {{CACHE TABLE}} and {{ADD JAR}} etc. are currently parsed in a quite hacky way, like this: {code} if (sql.trim.toLowerCase.startsWith("cache table")) { sql.trim.toLowerCase.startsWith("cache table") match { ... } } {code} It would be much better to add an extra parser combinator that parses these syntax extensions first, and then fallback to the normal Hive parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3713) Use JSON to serialize DataType
Cheng Lian created SPARK-3713: - Summary: Use JSON to serialize DataType Key: SPARK-3713 URL: https://issues.apache.org/jira/browse/SPARK-3713 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian Currently we are using compiler generated {{toString}} method for case classes to serialize {{DataType}} objects, which is dangerous and already introduced some bugs (e.g. SPARK-3421). Moreover, we also serialize schema in this format and write into generated Parquet metadata. Using JSON can fix all these known and potential issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3713) Use JSON to serialize DataType
[ https://issues.apache.org/jira/browse/SPARK-3713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3713: -- Issue Type: Improvement (was: Bug) > Use JSON to serialize DataType > -- > > Key: SPARK-3713 > URL: https://issues.apache.org/jira/browse/SPARK-3713 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian > > Currently we are using compiler generated {{toString}} method for case > classes to serialize {{DataType}} objects, which is dangerous and already > introduced some bugs (e.g. SPARK-3421). Moreover, we also serialize schema in > this format and write into generated Parquet metadata. Using JSON can fix all > these known and potential issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3738) InsertIntoHiveTable can't handle strings with "\n"
Cheng Lian created SPARK-3738: - Summary: InsertIntoHiveTable can't handle strings with "\n" Key: SPARK-3738 URL: https://issues.apache.org/jira/browse/SPARK-3738 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian Priority: Blocker Try the following snippet in {{sbt/sbt hive/console}} to reproduce: {code} sql("drop table if exists z") case class Str(s: String) sparkContext.parallelize(Str("a\nb") :: Nil, 1).saveAsTable("z") table("z").count() {code} Expected result should be 1, but 2 is returned instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3738) InsertIntoHiveTable can't handle strings with "\n"
[ https://issues.apache.org/jira/browse/SPARK-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152765#comment-14152765 ] Cheng Lian commented on SPARK-3738: --- False alarm... it's because of Hive's default SerDe uses '\n' as record delimiter. > InsertIntoHiveTable can't handle strings with "\n" > -- > > Key: SPARK-3738 > URL: https://issues.apache.org/jira/browse/SPARK-3738 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian >Priority: Blocker > > Try the following snippet in {{sbt/sbt hive/console}} to reproduce: > {code} > sql("drop table if exists z") > case class Str(s: String) > sparkContext.parallelize(Str("a\nb") :: Nil, 1).saveAsTable("z") > table("z").count() > {code} > Expected result should be 1, but 2 is returned instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3738) InsertIntoHiveTable can't handle strings with "\n"
[ https://issues.apache.org/jira/browse/SPARK-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian closed SPARK-3738. - Resolution: Invalid False alarm, it's because of Hive's default SerDe, which uses '\n' as record delimiter. > InsertIntoHiveTable can't handle strings with "\n" > -- > > Key: SPARK-3738 > URL: https://issues.apache.org/jira/browse/SPARK-3738 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian >Priority: Blocker > > Try the following snippet in {{sbt/sbt hive/console}} to reproduce: > {code} > sql("drop table if exists z") > case class Str(s: String) > sparkContext.parallelize(Str("a\nb") :: Nil, 1).saveAsTable("z") > table("z").count() > {code} > Expected result should be 1, but 2 is returned instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3791) HiveThriftServer2 returns 0.12.0 to ODBC SQLGetInfo call
Cheng Lian created SPARK-3791: - Summary: HiveThriftServer2 returns 0.12.0 to ODBC SQLGetInfo call Key: SPARK-3791 URL: https://issues.apache.org/jira/browse/SPARK-3791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian The "DBMS Server version" should be Spark version rather than Hive version: {code} ... {"ts":"2014-10-03T07:01:21.679","pid":23188,"tid":"2034","sev":"info","req":"-","sess":"-","site":"-","user":"-","k":"msg","v":"GenericODBCProtocol: DBMS Server version: 0.12.0"} ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3791) HiveThriftServer2 returns 0.12.0 to ODBC SQLGetInfo call
[ https://issues.apache.org/jira/browse/SPARK-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3791: -- Target Version/s: 1.1.1 (was: 1.2.0) > HiveThriftServer2 returns 0.12.0 to ODBC SQLGetInfo call > > > Key: SPARK-3791 > URL: https://issues.apache.org/jira/browse/SPARK-3791 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian > > The "DBMS Server version" should be Spark version rather than Hive version: > {code} > ... > {"ts":"2014-10-03T07:01:21.679","pid":23188,"tid":"2034","sev":"info","req":"-","sess":"-","site":"-","user":"-","k":"msg","v":"GenericODBCProtocol: > DBMS Server version: 0.12.0"} > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3810) Rule PreInsertionCasts doesn't handle partitioned table properly
Cheng Lian created SPARK-3810: - Summary: Rule PreInsertionCasts doesn't handle partitioned table properly Key: SPARK-3810 URL: https://issues.apache.org/jira/browse/SPARK-3810 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian Priority: Minor This issue can be reproduced by the following {{sbt/sbt hive/console}} session: {code} scala> loadTestTable("src") ... scala> loadTestTable("srcpart") ... scala> sql("INSERT INTO TABLE srcpart PARTITION (ds='1', hr='2') SELECT key, value FROM src").queryExecution ... == Parsed Logical Plan == InsertIntoTable (UnresolvedRelation None, srcpart, None), Map(ds -> Some(hello), hr -> Some(world)), false Project ['key,'value] UnresolvedRelation None, src, None == Analyzed Logical Plan == InsertIntoTable (MetastoreRelation default, srcpart, None), Map(ds -> Some(hello), hr -> Some(world)), false Project [key#50,value#51] Project [key#50,value#51] Project [key#50,value#51] Project [key#50,value#51] Project [key#50,value#51] Project [key#50,value#51] Project [key#50,value#51] Project [key#50,value#51] Project [key#50,value#51] Project [key#50,value#51] Project [key#50,value#51] Project [key... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3810) Rule PreInsertionCasts doesn't handle partitioned table properly
[ https://issues.apache.org/jira/browse/SPARK-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160151#comment-14160151 ] Cheng Lian commented on SPARK-3810: --- This issue is marked as MINOR because it doesn't affect correctness. All the redundant {{Project}}s can be removed by the subsequent optimization phase. > Rule PreInsertionCasts doesn't handle partitioned table properly > > > Key: SPARK-3810 > URL: https://issues.apache.org/jira/browse/SPARK-3810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian >Priority: Minor > > This issue can be reproduced by the following {{sbt/sbt hive/console}} > session: > {code} > scala> loadTestTable("src") > ... > scala> loadTestTable("srcpart") > ... > scala> sql("INSERT INTO TABLE srcpart PARTITION (ds='1', hr='2') SELECT key, > value FROM src").queryExecution > ... > == Parsed Logical Plan == > InsertIntoTable (UnresolvedRelation None, srcpart, None), Map(ds -> > Some(hello), hr -> Some(world)), false > Project ['key,'value] > UnresolvedRelation None, src, None > == Analyzed Logical Plan == > InsertIntoTable (MetastoreRelation default, srcpart, None), Map(ds -> > Some(hello), hr -> Some(world)), false > Project [key#50,value#51] > Project [key#50,value#51] >Project [key#50,value#51] > Project [key#50,value#51] > Project [key#50,value#51] > Project [key#50,value#51] >Project [key#50,value#51] > Project [key#50,value#51] > Project [key#50,value#51] > Project [key#50,value#51] >Project [key#50,value#51] > Project [key... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3810) Rule PreInsertionCasts doesn't handle partitioned table properly
[ https://issues.apache.org/jira/browse/SPARK-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160151#comment-14160151 ] Cheng Lian edited comment on SPARK-3810 at 10/6/14 10:26 AM: - This issue is marked as MINOR because it doesn't affect correctness. All the redundant Projects can be removed by the subsequent optimization phase. was (Author: lian cheng): This issue is marked as MINOR because it doesn't affect correctness. All the redundant {{Project}}s can be removed by the subsequent optimization phase. > Rule PreInsertionCasts doesn't handle partitioned table properly > > > Key: SPARK-3810 > URL: https://issues.apache.org/jira/browse/SPARK-3810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian >Priority: Minor > > This issue can be reproduced by the following {{sbt/sbt hive/console}} > session: > {code} > scala> loadTestTable("src") > ... > scala> loadTestTable("srcpart") > ... > scala> sql("INSERT INTO TABLE srcpart PARTITION (ds='1', hr='2') SELECT key, > value FROM src").queryExecution > ... > == Parsed Logical Plan == > InsertIntoTable (UnresolvedRelation None, srcpart, None), Map(ds -> > Some(hello), hr -> Some(world)), false > Project ['key,'value] > UnresolvedRelation None, src, None > == Analyzed Logical Plan == > InsertIntoTable (MetastoreRelation default, srcpart, None), Map(ds -> > Some(hello), hr -> Some(world)), false > Project [key#50,value#51] > Project [key#50,value#51] >Project [key#50,value#51] > Project [key#50,value#51] > Project [key#50,value#51] > Project [key#50,value#51] >Project [key#50,value#51] > Project [key#50,value#51] > Project [key#50,value#51] > Project [key#50,value#51] >Project [key#50,value#51] > Project [key... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3421) StructField.toString should quote the name field to allow arbitrary character as struct field name
[ https://issues.apache.org/jira/browse/SPARK-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-3421. --- Resolution: Fixed > StructField.toString should quote the name field to allow arbitrary character > as struct field name > -- > > Key: SPARK-3421 > URL: https://issues.apache.org/jira/browse/SPARK-3421 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.2 >Reporter: Cheng Lian >Assignee: Cheng Lian > > The original use case is something like this: > {code} > // JSON snippet with "illegal" characters in field names > val json = > """{ "a(b)": { "c(d)": "hello" } }""" :: > """{ "a(b)": { "c(d)": "world" } }""" :: > Nil > val jsonSchemaRdd = sqlContext.jsonRDD(sparkContext.makeRDD(json)) > jsonSchemaRdd.saveAsParquetFile("/tmp/file.parquet") > java.lang.Exception: java.lang.RuntimeException: Unsupported dataType: > StructType(ArrayBuffer(StructField(a(b),StructType(ArrayBuffer(StructField(c(d),StringType,true))),true))), > [1.37] failure: `,' expected but `(' found > {code} > The reason is that, the {{DataType}} parser only allows {{\[a-zA-Z0-9_\]*}} > as struct field name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3892) Map type should have typeName
[ https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166812#comment-14166812 ] Cheng Lian commented on SPARK-3892: --- Actually {{MapType.simpleName}} can be simply removed, it's not used anywhere, I forgot to remove it while refactoring. {{DataType.typeName}} is defined as: {code} def typeName: String = this.getClass.getSimpleName.stripSuffix("$").dropRight(4).toLowerCase {code} So concrete {{DataType}} classes don't need to override {{typeName}} as long as their name ends with {{Type}}. > Map type should have typeName > - > > Key: SPARK-3892 > URL: https://issues.apache.org/jira/browse/SPARK-3892 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Adrian Wang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3892) Map type should have typeName
[ https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166822#comment-14166822 ] Cheng Lian commented on SPARK-3892: --- [~adrian-wang] You're right, it's a typo. So would you mind to change the priority of this ticket to Minor? > Map type should have typeName > - > > Key: SPARK-3892 > URL: https://issues.apache.org/jira/browse/SPARK-3892 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Adrian Wang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3892) Map type do not need simpleName
[ https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166848#comment-14166848 ] Cheng Lian commented on SPARK-3892: --- Ah, while working on the {{DataType}} JSON ser/de PR ([#2563|https://github.com/apache/spark/pull/2563]), I had once refactored {{simpleString}} to {{simpleName}}, and at last got the current version and removed all overrides from sub-classes. {{MapType.simpleName}} was not removed partly because its a member of {{object MapType}}, which is not a subclass of {{DataType}}. Sorry for the trouble and confusion. > Map type do not need simpleName > --- > > Key: SPARK-3892 > URL: https://issues.apache.org/jira/browse/SPARK-3892 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Adrian Wang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3892) Map type do not need simpleName
[ https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166886#comment-14166886 ] Cheng Lian commented on SPARK-3892: --- Please see my comments in the PR :) > Map type do not need simpleName > --- > > Key: SPARK-3892 > URL: https://issues.apache.org/jira/browse/SPARK-3892 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Adrian Wang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3914) InMemoryRelation should inherit statistics of its child to enable broadcast join
Cheng Lian created SPARK-3914: - Summary: InMemoryRelation should inherit statistics of its child to enable broadcast join Key: SPARK-3914 URL: https://issues.apache.org/jira/browse/SPARK-3914 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian When a table/query is cached, {{InMemoryRelation}} stores the physical plan rather than the logical plan of the original table/query, thus loses the statistics information and disables broadcast join optimization. Sample {{spark-shell}} session to reproduce this issue: {code} val sparkContext = sc import org.apache.spark.sql._ import sparkContext._ val sqlContext = new SQLContext(sparkContext) import sqlContext._ case class Sale(year: Int) makeRDD((1 to 100).map(Sale(_))).registerTempTable("sales") sql("select distinct year from sales limit 10").registerTempTable("tinyTable") cacheTable("tinyTable") sql("select * from sales join tinyTable on sales.year = tinyTable.year").queryExecution.executedPlan ... res3: org.apache.spark.sql.execution.SparkPlan = Project [year#4,year#5] ShuffledHashJoin [year#4], [year#5], BuildRight Exchange (HashPartitioning [year#4], 200) PhysicalRDD [year#4], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:37 Exchange (HashPartitioning [year#5], 200) InMemoryColumnarTableScan [year#5], [], (InMemoryRelation [year#5], false, 1000, StorageLevel(true, true, false, true, 1), (Limit 10)) {code} A workaround for this is to add a {{LIMIT}} operator above the {{InMemoryColumnarTableScan}} operator: {code} sql("select * from sales join (select * from tinyTable limit 10) tiny on sales.year = tiny.year").queryExecution.executedPlan ... res8: org.apache.spark.sql.execution.SparkPlan = Project [year#12,year#13] BroadcastHashJoin [year#12], [year#13], BuildRight PhysicalRDD [year#12], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:37 Limit 10 InMemoryColumnarTableScan [year#13], [], (InMemoryRelation [year#13], false, 1000, StorageLevel(true, true, false, true, 1), (Limit 10)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3919) HiveThriftServer2 fails to start because of Hive 0.12 metastore schema verification failure
Cheng Lian created SPARK-3919: - Summary: HiveThriftServer2 fails to start because of Hive 0.12 metastore schema verification failure Key: SPARK-3919 URL: https://issues.apache.org/jira/browse/SPARK-3919 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian When using MySQL backed Metastore with {{hive.metastore.schema.verification}} set to {{true}}, HiveThriftServer2 fails to start: {code} 14/10/12 17:05:01 ERROR HiveThriftServer2: Error starting HiveThriftServer2 org.apache.hive.service.ServiceException: Failed to Start HiveServer2 at org.apache.hive.service.CompositeService.start(CompositeService.java:80) at org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:73) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:335) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.hive.service.ServiceException: Unable to connect to MetaStore! at org.apache.hive.service.cli.CLIService.start(CLIService.java:85) at org.apache.hive.service.CompositeService.start(CompositeService.java:70) ... 10 more Caused by: MetaException(message:Hive Schema version 0.12.0-protobuf-2.5 does not match metastore's schema version 0.12.0 Metastore is not upgraded or corrupt) at org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:5651) at org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:5622) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124) at com.sun.proxy.$Proxy11.verifySchema(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:403) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:441) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:326) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.(HiveMetaStore.java:286) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:54) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59) at org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4060) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:121) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:104) at org.apache.hive.service.cli.CLIService.start(CLIService.java:82) ... 11 more {code} Seems that recent Akka/Protobuf dependency changes are related to this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3919) HiveThriftServer2 fails to start because of Hive 0.12 metastore schema verification failure
[ https://issues.apache.org/jira/browse/SPARK-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3919: -- Description: When using MySQL backed Metastore with {{hive.metastore.schema.verification}} set to {{true}}, HiveThriftServer2 fails to start: {code} 14/10/12 17:05:01 ERROR HiveThriftServer2: Error starting HiveThriftServer2 org.apache.hive.service.ServiceException: Failed to Start HiveServer2 at org.apache.hive.service.CompositeService.start(CompositeService.java:80) at org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:73) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:335) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.hive.service.ServiceException: Unable to connect to MetaStore! at org.apache.hive.service.cli.CLIService.start(CLIService.java:85) at org.apache.hive.service.CompositeService.start(CompositeService.java:70) ... 10 more Caused by: MetaException(message:Hive Schema version 0.12.0-protobuf-2.5 does not match metastore's schema version 0.12.0 Metastore is not upgraded or corrupt) at org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:5651) at org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:5622) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124) at com.sun.proxy.$Proxy11.verifySchema(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:403) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:441) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:326) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.(HiveMetaStore.java:286) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:54) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59) at org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4060) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:121) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:104) at org.apache.hive.service.cli.CLIService.start(CLIService.java:82) ... 11 more {code} Seems that recent Akka/Protobuf dependency changes are related to this. A valid workaround is to set {{hive.metastore.schema.verification}} to {{false}}. was: When using MySQL backed Metastore with {{hive.metastore.schema.verification}} set to {{true}}, HiveThriftServer2 fails to start: {code} 14/10/12 17:05:01 ERROR HiveThriftServer2: Error starting HiveThriftServer2 org.apache.hive.service.ServiceException: Failed to Start HiveServer2 at org.apache.hive.service.CompositeService.start(CompositeService.java:80) at org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:73) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:335) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.hive.service.ServiceException: Unable to connect to MetaStore! at org.apache.hive.service.cli.CLIService.start(CLIService.jav
[jira] [Commented] (SPARK-3919) HiveThriftServer2 fails to start because of Hive 0.12 metastore schema verification failure
[ https://issues.apache.org/jira/browse/SPARK-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168664#comment-14168664 ] Cheng Lian commented on SPARK-3919: --- [~pwendell] Hive Metastore schema verification requires the version string to be exactly the same (except the {{-SNAPSHOT}} suffix). > HiveThriftServer2 fails to start because of Hive 0.12 metastore schema > verification failure > --- > > Key: SPARK-3919 > URL: https://issues.apache.org/jira/browse/SPARK-3919 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian > > When using MySQL backed Metastore with {{hive.metastore.schema.verification}} > set to {{true}}, HiveThriftServer2 fails to start: > {code} > 14/10/12 17:05:01 ERROR HiveThriftServer2: Error starting HiveThriftServer2 > org.apache.hive.service.ServiceException: Failed to Start HiveServer2 > at > org.apache.hive.service.CompositeService.start(CompositeService.java:80) > at org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:73) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:335) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: org.apache.hive.service.ServiceException: Unable to connect to > MetaStore! > at org.apache.hive.service.cli.CLIService.start(CLIService.java:85) > at > org.apache.hive.service.CompositeService.start(CompositeService.java:70) > ... 10 more > Caused by: MetaException(message:Hive Schema version 0.12.0-protobuf-2.5 does > not match metastore's schema version 0.12.0 Metastore is not upgraded or > corrupt) > at > org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:5651) > at > org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:5622) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124) > at com.sun.proxy.$Proxy11.verifySchema(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:403) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:441) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:326) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.(HiveMetaStore.java:286) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:54) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59) > at > org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4060) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:121) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:104) > at org.apache.hive.service.cli.CLIService.start(CLIService.java:82) > ... 11 more > {code} > Seems that recent Akka/Protobuf dependency changes are related to this. > A valid workaround is to set {{hive.metastore.schema.verification}} to > {{false}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3938) Set RDD name to table name during cache operations
[ https://issues.apache.org/jira/browse/SPARK-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14170454#comment-14170454 ] Cheng Lian commented on SPARK-3938: --- A problem here is that after PR [#2501|https://github.com/apache/spark/pull/2501], cached tables may share in-memory columnar RDDs, for example: {code} sql("CREATE TABLE src(key INT, value STRING)") val aRDD = sql("CACHE TABLE a AS SELECT * FROM src") val bRDD = sql("CACHE TABLE b AS SELECT key, value FROM src") {code} The two tables {{a}} and {{b}} share the same underlying in-memory columnar RDD because their queries have the same results. This can be easily verified from the Web UI. Furthermore, setting names to the resulted {{SchemaRDD}} ({{aRDD}} and {{bRDD}}) is useless, since the RDD that is actually cached is the underlying in-memory columnar RDD compiled from the logical plan. I think the best we can do is to try to list all in-memory table names in the in-memory columnar RDD name. On the other hand, current in-memory columnar RDD name is the string representation of the physical plan, which can be more useful for debugging. > Set RDD name to table name during cache operations > -- > > Key: SPARK-3938 > URL: https://issues.apache.org/jira/browse/SPARK-3938 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Patrick Wendell >Assignee: Cheng Lian > > When we create a table via "CACHE TABLE tbl" or "CACHE TABLE tbl AS SELECT", > we should name the created RDD with the table name. This will allow it to > render nicely in the storage tab, which is necessary when people look at the > storage tab to understand the caching behavior of Spark (e.g. percentage in > cache, etc). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4000) Gathers unit tests logs to Jenkins master at the end of a Jenkins build
Cheng Lian created SPARK-4000: - Summary: Gathers unit tests logs to Jenkins master at the end of a Jenkins build Key: SPARK-4000 URL: https://issues.apache.org/jira/browse/SPARK-4000 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.1.0 Reporter: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4000) Gathers unit tests logs to Jenkins master at the end of a Jenkins build
[ https://issues.apache.org/jira/browse/SPARK-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-4000: -- Description: Unit tests logs can be useful for debugging Jenkins failures. Currently these logs are deleted together with the build directory. We can scp the archived logs to the build history directory in Jenkins master, and then serve them via HTTP. > Gathers unit tests logs to Jenkins master at the end of a Jenkins build > --- > > Key: SPARK-4000 > URL: https://issues.apache.org/jira/browse/SPARK-4000 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.1.0 >Reporter: Cheng Lian > > Unit tests logs can be useful for debugging Jenkins failures. Currently these > logs are deleted together with the build directory. We can scp the archived > logs to the build history directory in Jenkins master, and then serve them > via HTTP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4037) NPE in JDBC server when calling SET
[ https://issues.apache.org/jira/browse/SPARK-4037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179305#comment-14179305 ] Cheng Lian commented on SPARK-4037: --- This is a regression of SPARK-2814, added in SPARK-3729. One solution is let HiveContext always reuse SessionState started within the same thread, if there's none, create a new one. In this way, we don't need to override the SessionState field in HiveThriftServer2, thus eliminate this issue. > NPE in JDBC server when calling SET > --- > > Key: SPARK-4037 > URL: https://issues.apache.org/jira/browse/SPARK-4037 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Blocker > > {code} > SET spark.sql.shuffle.partitions=10; > {code} > {code} > 14/10/21 18:00:47 ERROR server.SparkSQLOperationManager: Error executing > query: > java.lang.NullPointerException > at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309) > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:244) > at > org.apache.spark.sql.execution.SetCommand.sideEffectResult$lzycompute(commands.scala:64) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4037) NPE in JDBC server when calling SET
[ https://issues.apache.org/jira/browse/SPARK-4037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179598#comment-14179598 ] Cheng Lian commented on SPARK-4037: --- I think we can safely remove the global singleton SessionState created in HiveThriftServer2, and replace it with the SessionState field instance in HiveContext. The current global singleton SessionState design dates back to Shark (SharkContext.sessionState). This basically breaks session isolation in Shark, and later HiveThriftServer2 because all connections/sessions share a single SessionState instance. For example, switching current database in one connection also affects other concurrent connections (SPARK-3552). Fixing this properly requires major refactoring of HiveContext. Considering 1.2.0 release is approaching, and Hive support will probably be rewritten against the newly introduced external data source API, I'd like to fix this specific SessionState initialization issue for Spark 1.2.0 release, and leave multi-user support to a later release. > NPE in JDBC server when calling SET > --- > > Key: SPARK-4037 > URL: https://issues.apache.org/jira/browse/SPARK-4037 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Blocker > > {code} > SET spark.sql.shuffle.partitions=10; > {code} > {code} > 14/10/21 18:00:47 ERROR server.SparkSQLOperationManager: Error executing > query: > java.lang.NullPointerException > at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309) > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:244) > at > org.apache.spark.sql.execution.SetCommand.sideEffectResult$lzycompute(commands.scala:64) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3939) NPE caused by SessionState.out not set in thriftserver2
[ https://issues.apache.org/jira/browse/SPARK-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian closed SPARK-3939. - Resolution: Duplicate > NPE caused by SessionState.out not set in thriftserver2 > --- > > Key: SPARK-3939 > URL: https://issues.apache.org/jira/browse/SPARK-3939 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Adrian Wang >Assignee: Adrian Wang > > a simple 'set' query can reproduce this in thriftserver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3939) NPE caused by SessionState.out not set in thriftserver2
[ https://issues.apache.org/jira/browse/SPARK-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179749#comment-14179749 ] Cheng Lian commented on SPARK-3939: --- Ah, actually it's SPARK-4037 who duplicates this ticket since this one came earlier, just realized it after closing this ticket, sorry for the confusion... > NPE caused by SessionState.out not set in thriftserver2 > --- > > Key: SPARK-3939 > URL: https://issues.apache.org/jira/browse/SPARK-3939 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Adrian Wang >Assignee: Adrian Wang > > a simple 'set' query can reproduce this in thriftserver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4021) Issues observed after upgrading Jenkins to JDK7u71
[ https://issues.apache.org/jira/browse/SPARK-4021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181199#comment-14181199 ] Cheng Lian commented on SPARK-4021: --- Hi [~shaneknapp], I think this shows some clue, notice the GCJ 1.5.0.0 {{javac}}: {code} [lian@amp-jenkins-slave-05 ~]$ locate javac /etc/alternatives/javac /usr/bin/javac /usr/java/jdk1.7.0_71/bin/javac /usr/java/jdk1.7.0_71/man/ja_JP.UTF-8/man1/javac.1 /usr/java/jdk1.7.0_71/man/man1/javac.1 /usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/bin/javac /usr/lib64/R/etc/javaconf /usr/share/vim/vim72/compiler/javac.vim /usr/share/vim/vim72/syntax/javacc.vim /var/lib/alternatives/javac [lian@amp-jenkins-slave-05 ~]$ ls -alh /etc/alternatives/javac lrwxrwxrwx. 1 root root 37 Sep 29 17:17 /etc/alternatives/javac -> /usr/lib/jvm/java-1.5.0-gcj/bin/javac [lian@amp-jenkins-slave-05 ~]$ {code} > Issues observed after upgrading Jenkins to JDK7u71 > -- > > Key: SPARK-4021 > URL: https://issues.apache.org/jira/browse/SPARK-4021 > Project: Spark > Issue Type: Bug > Components: Project Infra > Environment: JDK 7u71 >Reporter: Patrick Wendell >Assignee: shane knapp > > The following compile failure was observed after adding JDK7u71 to Jenkins. > However, this is likely a misconfiguration from Jenkins rather than an issue > with Spark (these errors are specific to JDK5, in fact). > {code} > [error] -- > [error] 1. WARNING in > /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java > (at line 83) > [error] private static final Logger logger = > Logger.getLogger(JavaKinesisWordCountASL.class); > [error] ^^ > [error] The field JavaKinesisWordCountASL.logger is never read locally > [error] -- > [error] 2. WARNING in > /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java > (at line 151) > [error] JavaDStream words = unionStreams.flatMap(new > FlatMapFunction() { > [error] > ^ > [error] The serializable class does not declare a static final > serialVersionUID field of type long > [error] -- > [error] 3. ERROR in > /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java > (at line 153) > [error] public Iterable call(byte[] line) { > [error] ^ > [error] The method call(byte[]) of type new > FlatMapFunction(){} must override a superclass method > [error] -- > [error] 4. WARNING in > /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java > (at line 160) > [error] new PairFunction() { > [error] ^^^ > [error] The serializable class does not declare a static final > serialVersionUID field of type long > [error] -- > [error] 5. ERROR in > /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java > (at line 162) > [error] public Tuple2 call(String s) { > [error] ^^ > [error] The method call(String) of type new > PairFunction(){} must override a superclass method > [error] -- > [error] 6. WARNING in > /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java > (at line 165) > [error] }).reduceByKey(new Function2() { > [error] ^^ > [error] The serializable class does not declare a static final > serialVersionUID field of type long > [error] -- > [error] 7. ERROR in > /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java > (at line 167) > [error] public Integer call(Integer i1, Integer i2) { > [error] > [error] The method call(Integer, Integer) of type new > Function2(){} must override a superclass method > [error] -- > [error] 7 prob
[jira] [Created] (SPARK-4091) Occasionally spark.local.dir can be deleted twice and causes test failure
Cheng Lian created SPARK-4091: - Summary: Occasionally spark.local.dir can be deleted twice and causes test failure Key: SPARK-4091 URL: https://issues.apache.org/jira/browse/SPARK-4091 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Cheng Lian By persisting an arbitrary RDD with storage level {{MEMORY_AND_DISK}}, Spark may occasionally throw the following exception when shutting down: {code} java.io.IOException: Failed to list files for dir: /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027005012-5bcd/0b at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:664) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173) {code} By adding log output to {{Utils.deleteRecursively}}, setting breakpoints at {{File.delete}} in IntelliJ, and asking IntelliJ to evaluate and log {{Thread.currentThread().getStackTrace()}} when the breakpoint is hit rather than suspend execution, we can get the following result, which shows {{spark.local.dir}} is deleted twice from both {{DiskBlockManager.stop}} and the shutdown hook installed in {{Utils}}: {code} +++ Deleting file: /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d Breakpoint reached at java.io.File.delete(File.java:1028) [java.lang.Thread.getStackTrace(Thread.java:1589) java.io.File.delete(File.java:1028) org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) scala.collection.mutable.HashSet.foreach(HashSet.scala:79) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)] +++ Deleting file: /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d Breakpoint reached at java.io.File.delete(File.java:1028) [java.lang.Thread.getStackTrace(Thread.java:1589) java.io.File.delete(File.java:1028) org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:157) org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:154) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:154) org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply$mcV$sp(DiskBlockManager.scala:147) org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:145) org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:145) org.apache.spark.util.Utils$.logUncaughtEx
[jira] [Updated] (SPARK-4091) Occasionally spark.local.dir can be deleted twice and causes test failure
[ https://issues.apache.org/jira/browse/SPARK-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-4091: -- Description: By persisting an arbitrary RDD with storage level {{MEMORY_AND_DISK}}, Spark may occasionally throw the following exception when shutting down: {code} java.io.IOException: Failed to list files for dir: /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027005012-5bcd/0b at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:664) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173) {code} By adding log output to {{Utils.deleteRecursively}}, setting breakpoints at {{File.delete}} in IntelliJ, and asking IntelliJ to evaluate and log {{Thread.currentThread().getStackTrace()}} when the breakpoint is hit rather than suspend execution, we can get the following result, which shows {{spark.local.dir}} is deleted twice from both {{DiskBlockManager.stop}} and the shutdown hook installed in {{Utils}}: {code} +++ Deleting file: /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d Breakpoint reached at java.io.File.delete(File.java:1028) [java.lang.Thread.getStackTrace(Thread.java:1589) java.io.File.delete(File.java:1028) org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) scala.collection.mutable.HashSet.foreach(HashSet.scala:79) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)] +++ Deleting file: /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d Breakpoint reached at java.io.File.delete(File.java:1028) [java.lang.Thread.getStackTrace(Thread.java:1589) java.io.File.delete(File.java:1028) org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:157) org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:154) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:154) org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply$mcV$sp(DiskBlockManager.scala:147) org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:145) org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:145) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:145)] {code} When this bug happens during Jenkins build, it fails {{CliSuite}}. was: By p
[jira] [Commented] (SPARK-4091) Occasionally spark.local.dir can be deleted twice and causes test failure
[ https://issues.apache.org/jira/browse/SPARK-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184741#comment-14184741 ] Cheng Lian commented on SPARK-4091: --- Yes, thanks [~joshrosen], closing this. > Occasionally spark.local.dir can be deleted twice and causes test failure > - > > Key: SPARK-4091 > URL: https://issues.apache.org/jira/browse/SPARK-4091 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Cheng Lian > > By persisting an arbitrary RDD with storage level {{MEMORY_AND_DISK}}, Spark > may occasionally throw the following exception when shutting down: > {code} > java.io.IOException: Failed to list files for dir: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027005012-5bcd/0b > at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:664) > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) > at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173) > {code} > By adding log output to {{Utils.deleteRecursively}}, setting breakpoints at > {{File.delete}} in IntelliJ, and asking IntelliJ to evaluate and log > {{Thread.currentThread().getStackTrace()}} when the breakpoint is hit rather > than suspend execution, we can get the following result, which shows > {{spark.local.dir}} is deleted twice from both {{DiskBlockManager.stop}} and > the shutdown hook installed in {{Utils}}: > {code} > +++ Deleting file: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d > Breakpoint reached at java.io.File.delete(File.java:1028) > [java.lang.Thread.getStackTrace(Thread.java:1589) > java.io.File.delete(File.java:1028) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) > scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) > org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)] > +++ Deleting file: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d > Breakpoint reached at java.io.File.delete(File.java:1028) > [java.lang.Thread.getStackTrace(Thread.java:1589) > java.io.File.delete(File.java:1028) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) > > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) > > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) > > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > > org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:157) > > org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:154) > > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > > org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:154) >
[jira] [Closed] (SPARK-4091) Occasionally spark.local.dir can be deleted twice and causes test failure
[ https://issues.apache.org/jira/browse/SPARK-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian closed SPARK-4091. - Resolution: Duplicate > Occasionally spark.local.dir can be deleted twice and causes test failure > - > > Key: SPARK-4091 > URL: https://issues.apache.org/jira/browse/SPARK-4091 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Cheng Lian > > By persisting an arbitrary RDD with storage level {{MEMORY_AND_DISK}}, Spark > may occasionally throw the following exception when shutting down: > {code} > java.io.IOException: Failed to list files for dir: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027005012-5bcd/0b > at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:664) > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) > at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173) > {code} > By adding log output to {{Utils.deleteRecursively}}, setting breakpoints at > {{File.delete}} in IntelliJ, and asking IntelliJ to evaluate and log > {{Thread.currentThread().getStackTrace()}} when the breakpoint is hit rather > than suspend execution, we can get the following result, which shows > {{spark.local.dir}} is deleted twice from both {{DiskBlockManager.stop}} and > the shutdown hook installed in {{Utils}}: > {code} > +++ Deleting file: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d > Breakpoint reached at java.io.File.delete(File.java:1028) > [java.lang.Thread.getStackTrace(Thread.java:1589) > java.io.File.delete(File.java:1028) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) > scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) > org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)] > +++ Deleting file: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d > Breakpoint reached at java.io.File.delete(File.java:1028) > [java.lang.Thread.getStackTrace(Thread.java:1589) > java.io.File.delete(File.java:1028) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) > > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) > > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) > > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > > org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:157) > > org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:154) > > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > > org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:154) > > org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.appl
[jira] [Created] (SPARK-4119) Don't rely on HIVE_DEV_HOME to find .q files
Cheng Lian created SPARK-4119: - Summary: Don't rely on HIVE_DEV_HOME to find .q files Key: SPARK-4119 URL: https://issues.apache.org/jira/browse/SPARK-4119 Project: Spark Issue Type: Test Components: SQL Affects Versions: 1.1.1 Reporter: Cheng Lian Priority: Minor After merging in Hive 0.13.1 support, a bunch of .q files and golden answer files got updated. Unfortunately, some .q were updated in Hive. For example, an ORDER BY clause was added to groupby1_limit.q for bug fix. With HIVE_DEV_HOME set, developers working on Hive 0.12.0 may end up with false test failures. Because .q files are looked up from HIVE_DEV_HOME and outdated .q files are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None
[ https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187757#comment-14187757 ] Cheng Lian commented on SPARK-3683: --- [~davies] I removed this special case for "NULL" because in this way we have no way to represent the literal string {{"NULL"}}. Maybe I was wrong here, validating Hive behavior in this case. > PySpark Hive query generates "NULL" instead of None > --- > > Key: SPARK-3683 > URL: https://issues.apache.org/jira/browse/SPARK-3683 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.0 >Reporter: Tamas Jambor >Assignee: Davies Liu > > When I run a Hive query in Spark SQL, I get the new Row object, where it does > not convert Hive NULL into Python None instead it keeps it string 'NULL'. > It's only an issue with String type, works with other types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None
[ https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187768#comment-14187768 ] Cheng Lian commented on SPARK-3683: --- Actually a Hive session was illustrated in SPARK-1959, and seems that Hive interprets {{"NULL"}} as a literal string whose contents is "NULL" rather than a null value. > PySpark Hive query generates "NULL" instead of None > --- > > Key: SPARK-3683 > URL: https://issues.apache.org/jira/browse/SPARK-3683 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.0 >Reporter: Tamas Jambor >Assignee: Davies Liu > > When I run a Hive query in Spark SQL, I get the new Row object, where it does > not convert Hive NULL into Python None instead it keeps it string 'NULL'. > It's only an issue with String type, works with other types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None
[ https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188194#comment-14188194 ] Cheng Lian commented on SPARK-3683: --- [~jamborta] Your concern is legitimate. However, unfortunately we have to take Hive compatibility into consideration in this case, otherwise people who run legacy Hive scripts with Spark SQL may get wrong query result. > PySpark Hive query generates "NULL" instead of None > --- > > Key: SPARK-3683 > URL: https://issues.apache.org/jira/browse/SPARK-3683 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.0 >Reporter: Tamas Jambor >Assignee: Davies Liu > > When I run a Hive query in Spark SQL, I get the new Row object, where it does > not convert Hive NULL into Python None instead it keeps it string 'NULL'. > It's only an issue with String type, works with other types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org