[jira] [Updated] (SPARK-45383) Missing case for RelationTimeTravel in CheckAnalysis

2023-09-29 Thread Ryan Johnson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Johnson updated SPARK-45383:
-
Description: 
{{CheckAnalysis.checkAnalysis0}} lacks a case for {{{}RelationTimeTravel{}}}, 
and since the latter is (intentionally) an {{UnresolvedLeafNode}} rather than a 
{{{}UnaryNode{}}}, the existing checks do not traverse it.

Result: Attempting time travel over a non-existing table produces a spark 
internal error from the [default 
case|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L818],
 rather than the expected {{{}AnalysisException{}}}:
{code:java}
[info]   Cause: org.apache.spark.SparkException: [INTERNAL_ERROR] Found the 
unresolved operator: 'RelationTimeTravel 'UnresolvedRelation [not_exists], [], 
false, 0
[info]   at 
org.apache.spark.SparkException$.internalError(SparkException.scala:77)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$54(CheckAnalysis.scala:753)
 {code}
Fix should be simple enough:
{code:java}
case tt: RelationTimeTravel =>
  checkAnalysis0(tt.table) {code}

  was:
{{CheckAnalysis.checkAnalysis0}} lacks a case for {{{}RelationTimeTravel{}}}, 
and since the latter is (intentionally) an {{UnresolvedLeafNode}} rather than a 
{{{}UnaryNode{}}}, the existing checks do not traverse it.

Result: Attempting time travel over a non-existing table produces a spark 
internal error from the [default 
case|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L818],
 rather than the expected {{{}AnalysisException{}}}:
{code:java}
[info]   Cause: org.apache.spark.SparkException: [INTERNAL_ERROR] Found the 
unresolved operator: 'RelationTimeTravel 'UnresolvedRelation [not_exists], [], 
false, 0
[info]   at 
org.apache.spark.SparkException$.internalError(SparkException.scala:77)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$54(CheckAnalysis.scala:753)
 {code}
Solution should be simple enough:
{code:java}
case tt: RelationTimeTravel =>
  checkAnalysis0(tt.table) {code}


> Missing case for RelationTimeTravel in CheckAnalysis
> 
>
> Key: SPARK-45383
> URL: https://issues.apache.org/jira/browse/SPARK-45383
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Ryan Johnson
>Priority: Major
>
> {{CheckAnalysis.checkAnalysis0}} lacks a case for {{{}RelationTimeTravel{}}}, 
> and since the latter is (intentionally) an {{UnresolvedLeafNode}} rather than 
> a {{{}UnaryNode{}}}, the existing checks do not traverse it.
> Result: Attempting time travel over a non-existing table produces a spark 
> internal error from the [default 
> case|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L818],
>  rather than the expected {{{}AnalysisException{}}}:
> {code:java}
> [info]   Cause: org.apache.spark.SparkException: [INTERNAL_ERROR] Found the 
> unresolved operator: 'RelationTimeTravel 'UnresolvedRelation [not_exists], 
> [], false, 0
> [info]   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:77)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$54(CheckAnalysis.scala:753)
>  {code}
> Fix should be simple enough:
> {code:java}
> case tt: RelationTimeTravel =>
>   checkAnalysis0(tt.table) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45383) Missing case for RelationTimeTravel in CheckAnalysis

2023-09-29 Thread Ryan Johnson (Jira)
Ryan Johnson created SPARK-45383:


 Summary: Missing case for RelationTimeTravel in CheckAnalysis
 Key: SPARK-45383
 URL: https://issues.apache.org/jira/browse/SPARK-45383
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Ryan Johnson


{{CheckAnalysis.checkAnalysis0}} lacks a case for {{{}RelationTimeTravel{}}}, 
and since the latter is (intentionally) an {{UnresolvedLeafNode}} rather than a 
{{{}UnaryNode{}}}, the existing checks do not traverse it.

Result: Attempting time travel over a non-existing table produces a spark 
internal error from the [default 
case|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L818],
 rather than the expected {{{}AnalysisException{}}}:
{code:java}
[info]   Cause: org.apache.spark.SparkException: [INTERNAL_ERROR] Found the 
unresolved operator: 'RelationTimeTravel 'UnresolvedRelation [not_exists], [], 
false, 0
[info]   at 
org.apache.spark.SparkException$.internalError(SparkException.scala:77)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$54(CheckAnalysis.scala:753)
 {code}
Solution should be simple enough:
{code:java}
case tt: RelationTimeTravel =>
  checkAnalysis0(tt.table) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45319) SHOW COLUMNS namespace is not catalog-aware

2023-09-25 Thread Ryan Johnson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Johnson updated SPARK-45319:
-
Description: 
Today, the namespace argument of {{SHOW COLUMNS}} command is only partly 
catalog-aware. 
[AstBuilder.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L4693]
 correctly pulls in and applies namespaces with more than one element, but 
[ResolveSessionCatalog.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L316]
 does not correctly validate a multi-element namespace against the qualified 
table name:
{code:java}
    case ShowColumns(ResolvedV1TableOrViewIdentifier(ident), ns, output) =>
      val v1TableName = ident
      val resolver = conf.resolver
      val db = ns match {
        case Some(db) if v1TableName.database.exists(!resolver(_, db.head)) =>
          throw 
QueryCompilationErrors.showColumnsWithConflictDatabasesError(db, v1TableName)
        case _ => ns.map(_.head)
      } {code}
By always/only checking {{db.head}} against {{{}v1TableName.database{}}}, it 
will not correctly handle e.g.
{code:java}
SHOW COLUMNS FOR a.b.table IN a.b -- incorrectly fails
{code}
There is also some ambiguity in the semantics of the check:
{code:java}
SHOW COLUMNS FOR a.b.table IN a -- fails today (should it succeed?) 
SHOW COLUMNS FOR a.b.table IN b -- succeeds today (should it fail?) {code}
It might work better to use an actual {{Identifier}} (instead of 
{{{}TableIdentifier{}}}), and compare its {{namespace}} against the 
user-provided namespace?

Tangentially: It's a arguably a bit strange for the validity check to reside in 
{{ResolveSessionCatalog}} rule, given that it doesn't actually have anything to 
do with session catalogs? This seems more an artifact of an implementation that 
searches specifically for V1 tables, and {{ResolveSessionCatalog}} providing 
matchers that make that job easier?

  was:
Today, the namespace argument of {{SHOW COLUMNS}} command is only partly 
catalog-aware. 
[AstBuilder.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L4693]
 correctly pulls in and applies namespaces with more than one element, but 
[ResolveSessionCatalog.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L316]
 does not correctly validate a multi-element namespace against the qualified 
table name:
{code:java}
    case ShowColumns(ResolvedV1TableOrViewIdentifier(ident), ns, output) =>
      val v1TableName = ident
      val resolver = conf.resolver
      val db = ns match {
        case Some(db) if v1TableName.database.exists(!resolver(_, db.head)) =>
          throw 
QueryCompilationErrors.showColumnsWithConflictDatabasesError(db, v1TableName)
        case _ => ns.map(_.head)
      } {code}
By always/only checking {{db.head}} against {{{}v1TableName.database{}}}, it 
will not correctly handle e.g.
{code:java}
SHOW COLUMNS FOR a.b.table IN a.b -- incorrectly fails
{code}
There is also some ambiguity in the semantics of the check:
{code:java}
SHOW COLUMNS FOR a.b.table IN a -- fails today (should it succeed?) 
SHOW COLUMNS FOR a.b.table IN b -- succeeds today (should it fail?) {code}
A better solution would use an actual {{Identifier}} and compare its 
{{namespace}} against the user-provided namespace.

Tangentially: It's a arguably a bit strange for the validity check to reside in 
{{ResolveSessionCatalog}} rule, given that it doesn't actually have anything to 
do with session catalogs? This seems more an artifact of an implementation that 
searches specifically for V1 tables, and {{ResolveSessionCatalog}} providing 
matchers that make that job easier?


> SHOW COLUMNS namespace is not catalog-aware 
> 
>
> Key: SPARK-45319
> URL: https://issues.apache.org/jira/browse/SPARK-45319
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Ryan Johnson
>Priority: Major
>
> Today, the namespace argument of {{SHOW COLUMNS}} command is only partly 
> catalog-aware. 
> [AstBuilder.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L4693]
>  correctly pulls in and applies namespaces with more than one element, but 
> [ResolveSessionCatalog.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L316]
>  does not correctly validate a multi-element namespace against the qualified 
> table name:
> {code:java}
>     case ShowColumns(Reso

[jira] [Updated] (SPARK-45319) SHOW COLUMNS namespace is not catalog-aware

2023-09-25 Thread Ryan Johnson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Johnson updated SPARK-45319:
-
Description: 
Today, the namespace argument of {{SHOW COLUMNS}} command is only partly 
catalog-aware. 
[AstBuilder.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L4693]
 correctly pulls in and applies namespaces with more than one element, but 
[ResolveSessionCatalog.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L316]
 does not correctly validate a multi-element namespace against the qualified 
table name:
{code:java}
    case ShowColumns(ResolvedV1TableOrViewIdentifier(ident), ns, output) =>
      val v1TableName = ident
      val resolver = conf.resolver
      val db = ns match {
        case Some(db) if v1TableName.database.exists(!resolver(_, db.head)) =>
          throw 
QueryCompilationErrors.showColumnsWithConflictDatabasesError(db, v1TableName)
        case _ => ns.map(_.head)
      } {code}
By always/only checking {{db.head}} against {{{}v1TableName.database{}}}, it 
will not correctly handle e.g.
{code:java}
SHOW COLUMNS FOR a.b.table IN a.b -- incorrectly fails
{code}
There is also ambiguity:
{code:java}
SHOW COLUMNS FOR a.b.table IN a -- fails today (should it succeed?) 
SHOW COLUMNS FOR a.b.table IN b -- succeeds today (should it fail?) {code}
A better solution would use an actual {{Identifier}} and compare its 
{{namespace}} against the user-provided namespace.

Tangentially: It's a arguably a bit strange for the validity check to reside in 
{{ResolveSessionCatalog}} rule, given that it doesn't actually have anything to 
do with session catalogs? This seems more an artifact of an implementation that 
searches specifically for V1 tables, and {{ResolveSessionCatalog}} providing 
matchers that make that job easier?

  was:
Today, the namespace argument of {{SHOW COLUMNS}} command is only partly 
catalog-aware. 
[AstBuilder.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L4693]
 correctly pulls in and applies namespaces with more than one element, but 
[ResolveSessionCatalog.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L316]
 does not correctly validate a multi-element namespace against the qualified 
table name:
{code:java}
    case ShowColumns(ResolvedV1TableOrViewIdentifier(ident), ns, output) =>
      val v1TableName = ident
      val resolver = conf.resolver
      val db = ns match {
        case Some(db) if v1TableName.database.exists(!resolver(_, db.head)) =>
          throw 
QueryCompilationErrors.showColumnsWithConflictDatabasesError(db, v1TableName)
        case _ => ns.map(_.head)
      } {code}
By always/only checking {{db.head}} against {{{}v1TableName.database{}}}, it 
will not correctly handle e.g.
{code:java}
SHOW COLUMNS FOR a.b.table IN a.b -- incorrectly fails
SHOW COLUMNS FOR a.b.table IN a -- fails today (should it succeed?)
SHOW COLUMNS FOR a.b.table IN b -- succeeds today (should it fail?){code}
A better solution would use an actual {{Identifier}} and compare its 
{{namespace}} against the user-provided namespace.

Tangentially: It's a arguably a bit strange for the validity check to reside in 
{{ResolveSessionCatalog}} rule, given that it doesn't actually have anything to 
do with session catalogs? This seems more an artifact of an implementation that 
searches specifically for V1 tables, and {{ResolveSessionCatalog}} providing 
matchers that make that job easier?


> SHOW COLUMNS namespace is not catalog-aware 
> 
>
> Key: SPARK-45319
> URL: https://issues.apache.org/jira/browse/SPARK-45319
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Ryan Johnson
>Priority: Major
>
> Today, the namespace argument of {{SHOW COLUMNS}} command is only partly 
> catalog-aware. 
> [AstBuilder.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L4693]
>  correctly pulls in and applies namespaces with more than one element, but 
> [ResolveSessionCatalog.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L316]
>  does not correctly validate a multi-element namespace against the qualified 
> table name:
> {code:java}
>     case ShowColumns(ResolvedV1TableOrViewIdentifier(ident), ns, output) =>
>       val v1TableName = ident
>       val resolver = conf.resolver
>       val db = ns match {
>      

[jira] [Updated] (SPARK-45319) SHOW COLUMNS namespace is not catalog-aware

2023-09-25 Thread Ryan Johnson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Johnson updated SPARK-45319:
-
Description: 
Today, the namespace argument of {{SHOW COLUMNS}} command is only partly 
catalog-aware. 
[AstBuilder.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L4693]
 correctly pulls in and applies namespaces with more than one element, but 
[ResolveSessionCatalog.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L316]
 does not correctly validate a multi-element namespace against the qualified 
table name:
{code:java}
    case ShowColumns(ResolvedV1TableOrViewIdentifier(ident), ns, output) =>
      val v1TableName = ident
      val resolver = conf.resolver
      val db = ns match {
        case Some(db) if v1TableName.database.exists(!resolver(_, db.head)) =>
          throw 
QueryCompilationErrors.showColumnsWithConflictDatabasesError(db, v1TableName)
        case _ => ns.map(_.head)
      } {code}
By always/only checking {{db.head}} against {{{}v1TableName.database{}}}, it 
will not correctly handle e.g.
{code:java}
SHOW COLUMNS FOR a.b.table IN a.b -- incorrectly fails
{code}
There is also some ambiguity in the semantics of the check:
{code:java}
SHOW COLUMNS FOR a.b.table IN a -- fails today (should it succeed?) 
SHOW COLUMNS FOR a.b.table IN b -- succeeds today (should it fail?) {code}
A better solution would use an actual {{Identifier}} and compare its 
{{namespace}} against the user-provided namespace.

Tangentially: It's a arguably a bit strange for the validity check to reside in 
{{ResolveSessionCatalog}} rule, given that it doesn't actually have anything to 
do with session catalogs? This seems more an artifact of an implementation that 
searches specifically for V1 tables, and {{ResolveSessionCatalog}} providing 
matchers that make that job easier?

  was:
Today, the namespace argument of {{SHOW COLUMNS}} command is only partly 
catalog-aware. 
[AstBuilder.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L4693]
 correctly pulls in and applies namespaces with more than one element, but 
[ResolveSessionCatalog.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L316]
 does not correctly validate a multi-element namespace against the qualified 
table name:
{code:java}
    case ShowColumns(ResolvedV1TableOrViewIdentifier(ident), ns, output) =>
      val v1TableName = ident
      val resolver = conf.resolver
      val db = ns match {
        case Some(db) if v1TableName.database.exists(!resolver(_, db.head)) =>
          throw 
QueryCompilationErrors.showColumnsWithConflictDatabasesError(db, v1TableName)
        case _ => ns.map(_.head)
      } {code}
By always/only checking {{db.head}} against {{{}v1TableName.database{}}}, it 
will not correctly handle e.g.
{code:java}
SHOW COLUMNS FOR a.b.table IN a.b -- incorrectly fails
{code}
There is also ambiguity:
{code:java}
SHOW COLUMNS FOR a.b.table IN a -- fails today (should it succeed?) 
SHOW COLUMNS FOR a.b.table IN b -- succeeds today (should it fail?) {code}
A better solution would use an actual {{Identifier}} and compare its 
{{namespace}} against the user-provided namespace.

Tangentially: It's a arguably a bit strange for the validity check to reside in 
{{ResolveSessionCatalog}} rule, given that it doesn't actually have anything to 
do with session catalogs? This seems more an artifact of an implementation that 
searches specifically for V1 tables, and {{ResolveSessionCatalog}} providing 
matchers that make that job easier?


> SHOW COLUMNS namespace is not catalog-aware 
> 
>
> Key: SPARK-45319
> URL: https://issues.apache.org/jira/browse/SPARK-45319
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Ryan Johnson
>Priority: Major
>
> Today, the namespace argument of {{SHOW COLUMNS}} command is only partly 
> catalog-aware. 
> [AstBuilder.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L4693]
>  correctly pulls in and applies namespaces with more than one element, but 
> [ResolveSessionCatalog.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L316]
>  does not correctly validate a multi-element namespace against the qualified 
> table name:
> {code:java}
>     case ShowColumns(ResolvedV1TableOrViewIdentifier(ident), ns, output) =>
>       val v1TableName

[jira] [Updated] (SPARK-45319) SHOW COLUMNS namespace is not catalog-aware

2023-09-25 Thread Ryan Johnson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Johnson updated SPARK-45319:
-
Description: 
Today, the namespace argument of {{SHOW COLUMNS}} command is only partly 
catalog-aware. 
[AstBuilder.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L4693]
 correctly pulls in and applies namespaces with more than one element, but 
[ResolveSessionCatalog.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L316]
 does not correctly validate a multi-element namespace against the qualified 
table name:
{code:java}
    case ShowColumns(ResolvedV1TableOrViewIdentifier(ident), ns, output) =>
      val v1TableName = ident
      val resolver = conf.resolver
      val db = ns match {
        case Some(db) if v1TableName.database.exists(!resolver(_, db.head)) =>
          throw 
QueryCompilationErrors.showColumnsWithConflictDatabasesError(db, v1TableName)
        case _ => ns.map(_.head)
      } {code}
By always/only checking {{db.head}} against {{{}v1TableName.database{}}}, it 
will not correctly handle e.g.
{code:java}
SHOW COLUMNS FOR a.b.table IN a.b -- incorrectly fails
SHOW COLUMNS FOR a.b.table IN a -- fails today (should it succeed?)
SHOW COLUMNS FOR a.b.table IN b -- succeeds today (should it fail?){code}
A better solution would use an actual {{Identifier}} and compare its 
{{namespace}} against the user-provided namespace.

Tangentially: It's a arguably a bit strange for the validity check to reside in 
{{ResolveSessionCatalog}} rule, given that it doesn't actually have anything to 
do with session catalogs? This seems more an artifact of an implementation that 
searches specifically for V1 tables, and {{ResolveSessionCatalog}} providing 
matchers that make that job easier?

  was:
Today, the namespace argument of {{SHOW COLUMNS}} command is only partly 
catalog-aware. 
[AstBuilder.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L4693]
 correctly pulls in and applies namespaces with more than one element, but 
[ResolveSessionCatalog.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L316]
 does not correctly validate a multi-element namespace against the qualified 
table name:
{code:java}
    case ShowColumns(ResolvedV1TableOrViewIdentifier(ident), ns, output) =>
      val v1TableName = ident
      val resolver = conf.resolver
      val db = ns match {
        case Some(db) if v1TableName.database.exists(!resolver(_, db.head)) =>
          throw 
QueryCompilationErrors.showColumnsWithConflictDatabasesError(db, v1TableName)
        case _ => ns.map(_.head)
      } {code}
By always/only checking {{db.head}} against {{{}v1TableName.database{}}}, it 
will not correctly handle e.g.
{code:java}
SHOW COLUMNS FOR a.b.table IN a -- incorrectly succeeds
SHOW COLUMNS FOR a.b.table IN a.b -- incorrectly fails{code}
A better solution would use an actual {{Identifier}} and compare its 
{{namespace}} against the user-provided namespace.

Tangentially: It's a arguably a bit strange for the validity check to reside in 
{{ResolveSessionCatalog}} rule, given that it doesn't actually have anything to 
do with session catalogs? This seems more an artifact of an implementation that 
searches specifically for V1 tables, and {{ResolveSessionCatalog}} providing 
matchers that make that job easier?


> SHOW COLUMNS namespace is not catalog-aware 
> 
>
> Key: SPARK-45319
> URL: https://issues.apache.org/jira/browse/SPARK-45319
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Ryan Johnson
>Priority: Major
>
> Today, the namespace argument of {{SHOW COLUMNS}} command is only partly 
> catalog-aware. 
> [AstBuilder.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L4693]
>  correctly pulls in and applies namespaces with more than one element, but 
> [ResolveSessionCatalog.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L316]
>  does not correctly validate a multi-element namespace against the qualified 
> table name:
> {code:java}
>     case ShowColumns(ResolvedV1TableOrViewIdentifier(ident), ns, output) =>
>       val v1TableName = ident
>       val resolver = conf.resolver
>       val db = ns match {
>         case Some(db) if v1TableName.database.exists(!resolver(_, db.head)) =>
>           throw 
> QueryCompilationErrors.showColu

[jira] [Created] (SPARK-45319) SHOW COLUMNS namespace is not catalog-aware

2023-09-25 Thread Ryan Johnson (Jira)
Ryan Johnson created SPARK-45319:


 Summary: SHOW COLUMNS namespace is not catalog-aware 
 Key: SPARK-45319
 URL: https://issues.apache.org/jira/browse/SPARK-45319
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Ryan Johnson


Today, the namespace argument of {{SHOW COLUMNS}} command is only partly 
catalog-aware. 
[AstBuilder.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L4693]
 correctly pulls in and applies namespaces with more than one element, but 
[ResolveSessionCatalog.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L316]
 does not correctly validate a multi-element namespace against the qualified 
table name:
{code:java}
    case ShowColumns(ResolvedV1TableOrViewIdentifier(ident), ns, output) =>
      val v1TableName = ident
      val resolver = conf.resolver
      val db = ns match {
        case Some(db) if v1TableName.database.exists(!resolver(_, db.head)) =>
          throw 
QueryCompilationErrors.showColumnsWithConflictDatabasesError(db, v1TableName)
        case _ => ns.map(_.head)
      } {code}
By always/only checking {{db.head}} against {{{}v1TableName.database{}}}, it 
will not correctly handle e.g.
{code:java}
SHOW COLUMNS FOR a.b.table IN a -- incorrectly succeeds
SHOW COLUMNS FOR a.b.table IN a.b -- incorrectly fails{code}
A better solution would use an actual {{Identifier}} and compare its 
{{namespace}} against the user-provided namespace.

Tangentially: It's a arguably a bit strange for the validity check to reside in 
{{ResolveSessionCatalog}} rule, given that it doesn't actually have anything to 
do with session catalogs? This seems more an artifact of an implementation that 
searches specifically for V1 tables, and {{ResolveSessionCatalog}} providing 
matchers that make that job easier?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44081) Simplify PartitionedFileUtil API a little

2023-06-16 Thread Ryan Johnson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Johnson updated SPARK-44081:
-
Summary: Simplify PartitionedFileUtil API a little  (was: Simplify 
PartitionedFileUtil API)

> Simplify PartitionedFileUtil API a little
> -
>
> Key: SPARK-44081
> URL: https://issues.apache.org/jira/browse/SPARK-44081
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Ryan Johnson
>Priority: Major
>
> Utility methods in {{PartitionedFileUtil}} take both {{file: 
> FileStatusWithMetadata}} and {{{}filePath: Path{}}}, even tho the latter can 
> be obtained easily from the former.
> This was originally done for performance reasons, to allow callers to pass a 
> memoized {{Path}} and thus avoid the cost of an additional 
> {{FileStatus.getPath}} call.
> Now that we have {{{}FileStatusWithMetadata{}}}, we no longer need the 
> redundancy because the new class can capture the path as a {{{}lazy val{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44081) Simplify PartitionedFileUtil API

2023-06-16 Thread Ryan Johnson (Jira)
Ryan Johnson created SPARK-44081:


 Summary: Simplify PartitionedFileUtil API
 Key: SPARK-44081
 URL: https://issues.apache.org/jira/browse/SPARK-44081
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Ryan Johnson


Utility methods in {{PartitionedFileUtil}} take both {{file: 
FileStatusWithMetadata}} and {{{}filePath: Path{}}}, even tho the latter can be 
obtained easily from the former.

This was originally done for performance reasons, to allow callers to pass a 
memoized {{Path}} and thus avoid the cost of an additional 
{{FileStatus.getPath}} call.

Now that we have {{{}FileStatusWithMetadata{}}}, we no longer need the 
redundancy because the new class can capture the path as a {{{}lazy val{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44071) Define UnresolvedNode trait to reduce redundancy

2023-06-15 Thread Ryan Johnson (Jira)
Ryan Johnson created SPARK-44071:


 Summary: Define UnresolvedNode trait to reduce redundancy
 Key: SPARK-44071
 URL: https://issues.apache.org/jira/browse/SPARK-44071
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Ryan Johnson


Looking at 
[unresolved.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala],
 spark would benefit from an {{UnresolvedNode}} trait that various 
{{UnresolvedFoo}} classes could inherit from:
{code:java}
trait UnresolvedNode extends LogicalPlan {
  override def output: Seq[Attribute] = Nil
  override lazy val resolved = false
}{code}
Today, the code is duplicated in ~20 locations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43226) Define extractors for file-constant metadata columns

2023-04-20 Thread Ryan Johnson (Jira)
Ryan Johnson created SPARK-43226:


 Summary: Define extractors for file-constant metadata columns
 Key: SPARK-43226
 URL: https://issues.apache.org/jira/browse/SPARK-43226
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Ryan Johnson


File-source constant metadata columns are often derived indirectly from 
file-level metadata values rather than exposing those values directly. For 
example, {{_metadata.file_name}} is currently hard-coded in 
{{FileFormat.updateMetadataInternalRow}} as:

 
{code:java}
UTF8String.fromString(filePath.getName){code}
 

We should add support for metadata extractors, functions that map from 
{{PartitionedFile}} to {{{}Literal{}}}, so that we can express such columns in 
a generic way instead of hard-coding them.

We can't just add them to the metadata map because then they have to be 
pre-computed even if it turns out the query does not select that field.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43039) Support custom fields in the file source _metadata column

2023-04-05 Thread Ryan Johnson (Jira)
Ryan Johnson created SPARK-43039:


 Summary: Support custom fields in the file source _metadata column
 Key: SPARK-43039
 URL: https://issues.apache.org/jira/browse/SPARK-43039
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Ryan Johnson


Today, the schema of the file source _metadata column depends on the file 
format (e.g. parquet file format supports {{{}_metadata.row_index{}}}) but this 
is hard-wired into the {{FileFormat}} itself. Not only is this an ugly design, 
it also prevents custom file formats from adding their own fields to the 
{{_metadata}} column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42704) SubqueryAlias should propagate metadata columns its child already selects

2023-03-07 Thread Ryan Johnson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Johnson updated SPARK-42704:
-
Description: 
The `AddMetadataColumns` analyzer rule intends to make resolve available 
metadata columns, even if the plan already contains projections that did not 
explicitly mention the metadata column.

The `SubqueryAlias` plan node intentionally does not propagate metadata columns 
automatically from a non-leaf/non-subquery child node, because the following 
should _not_ work:

 
{code:java}
spark.read.table("t").select("a", "b").as("s").select("_metadata"){code}
However, today it is too strict in breaks the metadata chain, in case the child 
node's output already includes the metadata column:

 
{code:java}
// expected to work (and does)
spark.read.table("t")
  .select("a", "b").select("_metadata")

// by extension, should also work (but does not)
spark.read.table("t").select("a", "b", "_metadata").as("s")
  .select("a", "b").select("_metadata"){code}
The solution is for `SubqueryAlias` to always propagate metadata columns that 
are already in the child's output, thus preserving the `metadataOutput` chain 
for that column.

  was:
The `AddMetadataColumns` analyzer rule intends to make resolve available 
metadata columns, even if the plan already contains projections that did not 
explicitly mention the metadata column.

The `SubqueryAlias` plan node intentionally does not propagate metadata columns 
automatically from a non-leaf/non-subquery child node, because the following 
should _not_ work:

 
{code:java}
spark.read.table("t").select("a", "b").as("s").select("_metadata"){code}
However, today it is too strict in breaks the metadata chain, in case the child 
node's output already includes the metadata column:

 
{code:java}
// expected to work
spark.read.table("t")
  .select("a", "b").select("_metadata")

// by extension, should also work (but does not)
spark.read.table("t").select("a", "b", "_metadata").as("s")
  .select("a", "b").select("_metadata"){code}
The solution is for `SubqueryAlias` to always propagate metadata columns that 
are already in the child's output, thus preserving the `metadataOutput` chain 
for that column.


> SubqueryAlias should propagate metadata columns its child already selects 
> --
>
> Key: SPARK-42704
> URL: https://issues.apache.org/jira/browse/SPARK-42704
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Ryan Johnson
>Priority: Major
>
> The `AddMetadataColumns` analyzer rule intends to make resolve available 
> metadata columns, even if the plan already contains projections that did not 
> explicitly mention the metadata column.
> The `SubqueryAlias` plan node intentionally does not propagate metadata 
> columns automatically from a non-leaf/non-subquery child node, because the 
> following should _not_ work:
>  
> {code:java}
> spark.read.table("t").select("a", "b").as("s").select("_metadata"){code}
> However, today it is too strict in breaks the metadata chain, in case the 
> child node's output already includes the metadata column:
>  
> {code:java}
> // expected to work (and does)
> spark.read.table("t")
>   .select("a", "b").select("_metadata")
> // by extension, should also work (but does not)
> spark.read.table("t").select("a", "b", "_metadata").as("s")
>   .select("a", "b").select("_metadata"){code}
> The solution is for `SubqueryAlias` to always propagate metadata columns that 
> are already in the child's output, thus preserving the `metadataOutput` chain 
> for that column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42704) SubqueryAlias should propagate metadata columns its child already selects

2023-03-07 Thread Ryan Johnson (Jira)
Ryan Johnson created SPARK-42704:


 Summary: SubqueryAlias should propagate metadata columns its child 
already selects 
 Key: SPARK-42704
 URL: https://issues.apache.org/jira/browse/SPARK-42704
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.2, 3.4.0
Reporter: Ryan Johnson


The `AddMetadataColumns` analyzer rule intends to make resolve available 
metadata columns, even if the plan already contains projections that did not 
explicitly mention the metadata column.

The `SubqueryAlias` plan node intentionally does not propagate metadata columns 
automatically from a non-leaf/non-subquery child node, because the following 
should _not_ work:

 
{code:java}
spark.read.table("t").select("a", "b").as("s").select("_metadata"){code}
However, today it is too strict in breaks the metadata chain, in case the child 
node's output already includes the metadata column:

 
{code:java}
// expected to work
spark.read.table("t")
  .select("a", "b").select("_metadata")

// by extension, should also work (but does not)
spark.read.table("t").select("a", "b", "_metadata").as("s")
  .select("a", "b").select("_metadata"){code}
The solution is for `SubqueryAlias` to always propagate metadata columns that 
are already in the child's output, thus preserving the `metadataOutput` chain 
for that column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42683) Automatically rename metadata columns that conflict with data schema columns

2023-03-07 Thread Ryan Johnson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Johnson updated SPARK-42683:
-
Target Version/s: 3.5.0  (was: 3.4.0)

> Automatically rename metadata columns that conflict with data schema columns
> 
>
> Key: SPARK-42683
> URL: https://issues.apache.org/jira/browse/SPARK-42683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Ryan Johnson
>Priority: Major
>
> Today, if a datasource already has a column called `_metadata`, queries 
> cannot access the file-source metadata column that normally carries that 
> name. We can address this conflict with two changes to metadata column 
> handling:
>  # Automatically rename any metadata column whose name conflicts with a data 
> schema column
>  # Add a facility to reliably find metadata columns by their original/logical 
> name, even if they were renamed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42683) Automatically rename metadata columns that conflict with data schema columns

2023-03-06 Thread Ryan Johnson (Jira)
Ryan Johnson created SPARK-42683:


 Summary: Automatically rename metadata columns that conflict with 
data schema columns
 Key: SPARK-42683
 URL: https://issues.apache.org/jira/browse/SPARK-42683
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.2, 3.4.0
Reporter: Ryan Johnson


Today, if a datasource already has a column called `_metadata`, queries cannot 
access the file-source metadata column that normally carries that name. We can 
address this conflict with two changes to metadata column handling:
 # Automatically rename any metadata column whose name conflicts with a data 
schema column
 # Add a facility to reliably find metadata columns by their original/logical 
name, even if they were renamed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40141) Task listener overloads no longer needed with JDK 8+

2022-08-18 Thread Ryan Johnson (Jira)
Ryan Johnson created SPARK-40141:


 Summary: Task listener overloads no longer needed with JDK 8+
 Key: SPARK-40141
 URL: https://issues.apache.org/jira/browse/SPARK-40141
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Ryan Johnson


TaskContext defines methods for registering completion and failure listeners, 
and the respective listener types qualify as functional interfaces in JDK 8+. 
This leads to awkward ambiguous overload errors with the overload of each 
function, that takes a function directly instead of a listener. Now that JDK 8 
is the minimum allowed, we can remove the unnecessary overloads, which not only 
simplifies the code, but also removes a source of frustration since it can be 
nearly impossible to predict when an ambiguous overload might be triggered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40106) Task failure handlers should always run if the task failed

2022-08-16 Thread Ryan Johnson (Jira)
Ryan Johnson created SPARK-40106:


 Summary: Task failure handlers should always run if the task failed
 Key: SPARK-40106
 URL: https://issues.apache.org/jira/browse/SPARK-40106
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Ryan Johnson


Today, if a task body succeeds, but a task completion listener fails, task 
failure listeners are not called -- even tho the task has indeed failed at that 
point.

If a completion listener fails, and failure listeners were not previously 
invoked, we should invoke them before running the remaining completion 
listeners.

Such a change would increase the utility of task listeners, especially ones 
intended to assist with task cleanup. 

To give one arbitrary example, code like this appears at several places in the 
code (taken from {{executeTask}} method of FileFormatWriter.scala):
{code:java}
    try {
      Utils.tryWithSafeFinallyAndFailureCallbacks(block = {
        // Execute the task to write rows out and commit the task.
        dataWriter.writeWithIterator(iterator)
        dataWriter.commit()
      })(catchBlock = {
        // If there is an error, abort the task
    dataWriter.abort()
        logError(s"Job $jobId aborted.")
      }, finallyBlock = {
        dataWriter.close()
      })
    } catch {
      case e: FetchFailedException =>
        throw e
      case f: FileAlreadyExistsException if 
SQLConf.get.fastFailFileFormatOutput =>
        // If any output file to write already exists, it does not make sense 
to re-run this task.
        // We throw the exception and let Executor throw ExceptionFailure to 
abort the job.
        throw new TaskOutputFileAlreadyExistException(f)
      case t: Throwable =>
        throw QueryExecutionErrors.taskFailedWhileWritingRowsError(t)
    }{code}
If failure listeners were reliably called, the above idiom could potentially be 
factored out as two failure listeners plus a completion listener, and reused 
rather than duplicated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org