date:20231110

[jira] [Updated] (SPARK-43258) Assign a name to the error class _LEGACY_ERROR_TEMP_202[3,4,5]

2023-11-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43258:
---
Labels: pull-request-available starter  (was: starter)

> Assign a name to the error class _LEGACY_ERROR_TEMP_202[3,4,5]
> --
>
> Key: SPARK-43258
> URL: https://issues.apache.org/jira/browse/SPARK-43258
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2023* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43258) Assign a name to the error class _LEGACY_ERROR_TEMP_202[3,4,5]

2023-11-10 Thread Deng Ziming (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deng Ziming updated SPARK-43258:

Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_202[3,4,5]  
(was: Assign a name to the error class _LEGACY_ERROR_TEMP_2023)

> Assign a name to the error class _LEGACY_ERROR_TEMP_202[3,4,5]
> --
>
> Key: SPARK-43258
> URL: https://issues.apache.org/jira/browse/SPARK-43258
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2023* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45893) Support drop multiple partitions in batch for hive

2023-11-10 Thread Wechar (Jira)

Wechar created SPARK-45893:
--

 Summary: Support drop multiple partitions in batch for hive
 Key: SPARK-45893
 URL: https://issues.apache.org/jira/browse/SPARK-45893
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Wechar


Support drop partitions in batch to improve the performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45869) Revisit and Improve Spark Standalone Cluster

2023-11-10 Thread Kent Yao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17785148#comment-17785148
 ] 

Kent Yao commented on SPARK-45869:
--

Thank you [~dongjoon] 

> Revisit and Improve Spark Standalone Cluster
> 
>
> Key: SPARK-45869
> URL: https://issues.apache.org/jira/browse/SPARK-45869
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 4.0.0
>
>
> This is an experimental internal configuration for advance users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45686) Fix `method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is deprecated`

2023-11-10 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-45686:


Assignee: Yang Jie

> Fix `method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is 
> deprecated`
> 
>
> Key: SPARK-45686
> URL: https://issues.apache.org/jira/browse/SPARK-45686
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> [error] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala:57:31:
>  method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is 
> deprecated (since 2.13.0): implicit conversions from Array to 
> immutable.IndexedSeq are implemented by copying; use `toIndexedSeq` 
> explicitly if you want to copy, or use the more efficient non-copying 
> ArraySeq.unsafeWrapArray
> [error] Applicable -Wconf / @nowarn filters for this fatal warning: msg= of the message>, cat=deprecation, 
> site=org.apache.spark.ml.linalg.Vector.equals, 
> origin=scala.LowPriorityImplicits2.copyArrayToImmutableIndexedSeq, 
> version=2.13.0
> [error]             Vectors.equals(s1.indices, s1.values, s2.indices, 
> s2.values)
> [error]                               ^
> [error] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala:57:54:
>  method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is 
> deprecated (since 2.13.0): implicit conversions from Array to 
> immutable.IndexedSeq are implemented by copying; use `toIndexedSeq` 
> explicitly if you want to copy, or use the more efficient non-copying 
> ArraySeq.unsafeWrapArray
> [error] Applicable -Wconf / @nowarn filters for this fatal warning: msg= of the message>, cat=deprecation, 
> site=org.apache.spark.ml.linalg.Vector.equals, 
> origin=scala.LowPriorityImplicits2.copyArrayToImmutableIndexedSeq, 
> version=2.13.0
> [error]             Vectors.equals(s1.indices, s1.values, s2.indices, 
> s2.values)
> [error]                                                      ^
> [error] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala:59:31:
>  method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is 
> deprecated (since 2.13.0): implicit conversions from Array to 
> immutable.IndexedSeq are implemented by copying; use `toIndexedSeq` 
> explicitly if you want to copy, or use the more efficient non-copying 
> ArraySeq.unsafeWrapArray
> [error] Applicable -Wconf / @nowarn filters for this fatal warning: msg= of the message>, cat=deprecation, 
> site=org.apache.spark.ml.linalg.Vector.equals, 
> origin=scala.LowPriorityImplicits2.copyArrayToImmutableIndexedSeq, 
> version=2.13.0
> [error]             Vectors.equals(s1.indices, s1.values, 0 until d1.size, 
> d1.values)
> [error]                               ^
> [error] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala:61:59:
>  method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is 
> deprecated (since 2.13.0): implicit conversions from Array to 
> immutable.IndexedSeq are implemented by copying; use `toIndexedSeq` 
> explicitly if you want to copy, or use the more efficient non-copying 
> ArraySeq.unsafeWrapArray
> [error] Applicable -Wconf / @nowarn filters for this fatal warning: msg= of the message>, cat=deprecation, 
> site=org.apache.spark.ml.linalg.Vector.equals, 
> origin=scala.LowPriorityImplicits2.copyArrayToImmutableIndexedSeq, 
> version=2.13.0
> [error]             Vectors.equals(0 until d1.size, d1.values, s1.indices, 
> s1.values)
> [error]  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45686) Fix `method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is deprecated`

2023-11-10 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-45686.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43670
[https://github.com/apache/spark/pull/43670]

> Fix `method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is 
> deprecated`
> 
>
> Key: SPARK-45686
> URL: https://issues.apache.org/jira/browse/SPARK-45686
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
> [error] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala:57:31:
>  method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is 
> deprecated (since 2.13.0): implicit conversions from Array to 
> immutable.IndexedSeq are implemented by copying; use `toIndexedSeq` 
> explicitly if you want to copy, or use the more efficient non-copying 
> ArraySeq.unsafeWrapArray
> [error] Applicable -Wconf / @nowarn filters for this fatal warning: msg= of the message>, cat=deprecation, 
> site=org.apache.spark.ml.linalg.Vector.equals, 
> origin=scala.LowPriorityImplicits2.copyArrayToImmutableIndexedSeq, 
> version=2.13.0
> [error]             Vectors.equals(s1.indices, s1.values, s2.indices, 
> s2.values)
> [error]                               ^
> [error] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala:57:54:
>  method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is 
> deprecated (since 2.13.0): implicit conversions from Array to 
> immutable.IndexedSeq are implemented by copying; use `toIndexedSeq` 
> explicitly if you want to copy, or use the more efficient non-copying 
> ArraySeq.unsafeWrapArray
> [error] Applicable -Wconf / @nowarn filters for this fatal warning: msg= of the message>, cat=deprecation, 
> site=org.apache.spark.ml.linalg.Vector.equals, 
> origin=scala.LowPriorityImplicits2.copyArrayToImmutableIndexedSeq, 
> version=2.13.0
> [error]             Vectors.equals(s1.indices, s1.values, s2.indices, 
> s2.values)
> [error]                                                      ^
> [error] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala:59:31:
>  method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is 
> deprecated (since 2.13.0): implicit conversions from Array to 
> immutable.IndexedSeq are implemented by copying; use `toIndexedSeq` 
> explicitly if you want to copy, or use the more efficient non-copying 
> ArraySeq.unsafeWrapArray
> [error] Applicable -Wconf / @nowarn filters for this fatal warning: msg= of the message>, cat=deprecation, 
> site=org.apache.spark.ml.linalg.Vector.equals, 
> origin=scala.LowPriorityImplicits2.copyArrayToImmutableIndexedSeq, 
> version=2.13.0
> [error]             Vectors.equals(s1.indices, s1.values, 0 until d1.size, 
> d1.values)
> [error]                               ^
> [error] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala:61:59:
>  method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is 
> deprecated (since 2.13.0): implicit conversions from Array to 
> immutable.IndexedSeq are implemented by copying; use `toIndexedSeq` 
> explicitly if you want to copy, or use the more efficient non-copying 
> ArraySeq.unsafeWrapArray
> [error] Applicable -Wconf / @nowarn filters for this fatal warning: msg= of the message>, cat=deprecation, 
> site=org.apache.spark.ml.linalg.Vector.equals, 
> origin=scala.LowPriorityImplicits2.copyArrayToImmutableIndexedSeq, 
> version=2.13.0
> [error]             Vectors.equals(0 until d1.size, d1.values, s1.indices, 
> s1.values)
> [error]  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40129) Decimal multiply can produce the wrong answer because it rounds twice

2023-11-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-40129:
---
Labels: pull-request-available  (was: )

> Decimal multiply can produce the wrong answer because it rounds twice
> -
>
> Key: SPARK-40129
> URL: https://issues.apache.org/jira/browse/SPARK-40129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Robert Joseph Evans
>Priority: Major
>  Labels: pull-request-available
>
> This looks like it has been around for a long time, but I have reproduced it 
> in 3.2.0+
> The example here is multiplying Decimal(38, 10) by another Decimal(38, 10), 
> but I think it can be reproduced with other number combinations, and possibly 
> with divide too.
> {code:java}
> Seq("9173594185998001607642838421.5479932913").toDF.selectExpr("CAST(value as 
> DECIMAL(38,10)) as a").selectExpr("a * CAST(-12 as 
> DECIMAL(38,10))").show(truncate=false)
> {code}
> This produces an answer in Spark of 
> {{-110083130231976019291714061058.575920}} But if I do the calculation in 
> regular java BigDecimal I get {{-110083130231976019291714061058.575919}}
> {code:java}
> BigDecimal l = new BigDecimal("9173594185998001607642838421.5479932913");
> BigDecimal r = new BigDecimal("-12.00");
> BigDecimal prod = l.multiply(r);
> BigDecimal rounded_prod = prod.setScale(6, RoundingMode.HALF_UP);
> {code}
> Spark does essentially all of the same operations, but it used Decimal to do 
> it instead of java's BigDecimal directly. Spark, by way of Decimal, will set 
> a MathContext for the multiply operation that has a max precision of 38 and 
> will do half up rounding. That means that the result of the multiply 
> operation in Spark is {{{}-110083130231976019291714061058.57591950{}}}, but 
> for the java BigDecimal code the result is 
> {{{}-110083130231976019291714061058.575919495600{}}}. Then in 
> CheckOverflow for 3.2.0 and 3.3.0 or in just the regular Multiply expression 
> in 3.4.0 the setScale is called (as a part of Decimal.setPrecision). At that 
> point the already rounded number is rounded yet again resulting in what is 
> arguably a wrong answer by Spark.
> I have not fully tested this, but it looks like we could just remove the 
> MathContext entirely in Decimal, or set it to UNLIMITED. All of the decimal 
> operations appear to have their own overflow and rounding anyways. If we want 
> to potentially reduce the total memory usage, we could also set the max 
> precision to 39 and truncate (round down) the result in the math context 
> instead.  That would then let us round the result correctly in setPrecision 
> afterwards.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45877) ExecutorFailureTracker support for standalone mode

2023-11-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45877:
-

Assignee: Kent Yao

> ExecutorFailureTracker support for standalone mode
> --
>
> Key: SPARK-45877
> URL: https://issues.apache.org/jira/browse/SPARK-45877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> ExecutorFailureTracker now works for k8s and yarn, I guess it also an 
> important feature for standalone to have



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45892) Refactor the optimizer plan validation

2023-11-10 Thread Xi Liang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17785090#comment-17785090
 ] 

Xi Liang commented on SPARK-45892:
--

cc [~cloud_fan] here's the PR [https://github.com/apache/spark/pull/43761] 
PTAL, thanks!

> Refactor the optimizer plan validation
> --
>
> Key: SPARK-45892
> URL: https://issues.apache.org/jira/browse/SPARK-45892
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Xi Liang
>Priority: Major
>  Labels: pull-request-available
>
> Currently, the expressionIDUniqueness validation is closely coupled with 
> output schema validation. 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L403C7-L411C8
> Some refactoring can improve readability and reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45892) Refactor the optimizer plan validation

2023-11-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45892:
---
Labels: pull-request-available  (was: )

> Refactor the optimizer plan validation
> --
>
> Key: SPARK-45892
> URL: https://issues.apache.org/jira/browse/SPARK-45892
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Xi Liang
>Priority: Major
>  Labels: pull-request-available
>
> Currently, the expressionIDUniqueness validation is closely coupled with 
> output schema validation. 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L403C7-L411C8
> Some refactoring can improve readability and reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45892) Refactor the optimizer plan validation

2023-11-10 Thread Xi Liang (Jira)

Xi Liang created SPARK-45892:


 Summary: Refactor the optimizer plan validation
 Key: SPARK-45892
 URL: https://issues.apache.org/jira/browse/SPARK-45892
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 3.4.1, 3.4.0
Reporter: Xi Liang


Currently, the expressionIDUniqueness validation is closely coupled with output 
schema validation. 

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L403C7-L411C8

Some refactoring can improve readability and reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45731) Update partition statistics with ANALYZE TABLE command

2023-11-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45731:
--
Affects Version/s: 4.0.0
   (was: 3.5.0)

> Update partition statistics with ANALYZE TABLE command
> --
>
> Key: SPARK-45731
> URL: https://issues.apache.org/jira/browse/SPARK-45731
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently {{ANALYZE TABLE}} command only updates table-level stats but not 
> partition stats, even though it can be applied to both non-partitioned and 
> partitioned tables. It seems make sense for it to update partition stats as 
> well.
> Note users can use {{ANALYZE TABLE PARTITION(..)}} to get the same effect, 
> but the syntax is more verbose as they need to specify all the partition 
> columns. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45827) Add Variant data type in Spark

2023-11-10 Thread Chenhao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenhao Li updated SPARK-45827:
---
Summary: Add Variant data type in Spark  (was: Add variant data type in 
Spark)

> Add Variant data type in Spark
> --
>
> Key: SPARK-45827
> URL: https://issues.apache.org/jira/browse/SPARK-45827
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45891) Support Variant data type

2023-11-10 Thread Chenhao Li (Jira)

Chenhao Li created SPARK-45891:
--

 Summary: Support Variant data type
 Key: SPARK-45891
 URL: https://issues.apache.org/jira/browse/SPARK-45891
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 4.0.0
Reporter: Chenhao Li






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45884) Upgrade ORC to 1.8.6

2023-11-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45884:
--
Affects Version/s: 3.4.1
   (was: 3.4.2)

> Upgrade ORC to 1.8.6
> 
>
> Key: SPARK-45884
> URL: https://issues.apache.org/jira/browse/SPARK-45884
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45884) Upgrade ORC to 1.8.6

2023-11-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45884.
---
Fix Version/s: 3.4.2
   Resolution: Fixed

Issue resolved by pull request 43755
[https://github.com/apache/spark/pull/43755]

> Upgrade ORC to 1.8.6
> 
>
> Key: SPARK-45884
> URL: https://issues.apache.org/jira/browse/SPARK-45884
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45885) Upgrade ORC to 1.7.10

2023-11-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45885.
---
Fix Version/s: 3.3.4
   Resolution: Fixed

Issue resolved by pull request 43756
[https://github.com/apache/spark/pull/43756]

> Upgrade ORC to 1.7.10
> -
>
> Key: SPARK-45885
> URL: https://issues.apache.org/jira/browse/SPARK-45885
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.3
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45883) Upgrade ORC to 1.9.2

2023-11-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45883.
---
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 43754
[https://github.com/apache/spark/pull/43754]

> Upgrade ORC to 1.9.2
> 
>
> Key: SPARK-45883
> URL: https://issues.apache.org/jira/browse/SPARK-45883
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.5.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45889) Implement push-down filter with partition ID and grouping key (if possible) for state data source reader

2023-11-10 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-45889:
-
Summary: Implement push-down filter with partition ID and grouping key (if 
possible) for state data source reader  (was: Implement push-down filter with 
partition ID and grouping key (if possible))

> Implement push-down filter with partition ID and grouping key (if possible) 
> for state data source reader
> 
>
> Key: SPARK-45889
> URL: https://issues.apache.org/jira/browse/SPARK-45889
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> If the query filters the state data via partition ID, it is a good chance for 
> state data source to avoid spinning all state store instances and wasting 
> resource. We can spin state store instances for only necessary partitions.
> Same thing applies to grouping keys, although the criteria on distribution is 
> bound to the operator rather than the key in state store, hence it could be 
> very tricky unless we can follow the same criteria on distribution for the 
> operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45890) Implement limit push down for state data source reader

2023-11-10 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-45890:


 Summary: Implement limit push down for state data source reader
 Key: SPARK-45890
 URL: https://issues.apache.org/jira/browse/SPARK-45890
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Jungtaek Lim


Implement limit push down to optimize the query for sample data read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45889) Implement push-down filter with partition ID and grouping key (if possible)

2023-11-10 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-45889:


 Summary: Implement push-down filter with partition ID and grouping 
key (if possible)
 Key: SPARK-45889
 URL: https://issues.apache.org/jira/browse/SPARK-45889
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Jungtaek Lim


If the query filters the state data via partition ID, it is a good chance for 
state data source to avoid spinning all state store instances and wasting 
resource. We can spin state store instances for only necessary partitions.

Same thing applies to grouping keys, although the criteria on distribution is 
bound to the operator rather than the key in state store, hence it could be 
very tricky unless we can follow the same criteria on distribution for the 
operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45888) Apply error class framework to state data source & state metadata data source

2023-11-10 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-45888:


 Summary: Apply error class framework to state data source & state 
metadata data source
 Key: SPARK-45888
 URL: https://issues.apache.org/jira/browse/SPARK-45888
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Jungtaek Lim


Intended to be a blocker issue for the release of state data source reader.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45687) Fix `Passing an explicit array value to a Scala varargs method is deprecated`

2023-11-10 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-45687.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43642
[https://github.com/apache/spark/pull/43642]

> Fix `Passing an explicit array value to a Scala varargs method is deprecated`
> -
>
> Key: SPARK-45687
> URL: https://issues.apache.org/jira/browse/SPARK-45687
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Tengfei Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Passing an explicit array value to a Scala varargs method is deprecated 
> (since 2.13.0) and will result in a defensive copy; Use the more efficient 
> non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call
>  
> {code:java}
> [warn] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala:945:21:
>  Passing an explicit array value to a Scala varargs method is deprecated 
> (since 2.13.0) and will result in a defensive copy; Use the more efficient 
> non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call
> [warn] Applicable -Wconf / @nowarn filters for this warning: msg= message>, cat=deprecation, 
> site=org.apache.spark.sql.hive.execution.AggregationQuerySuite, version=2.13.0
> [warn]         df.agg(udaf(allColumns: _*)),
> [warn]                     ^
> [warn] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ObjectHashAggregateSuite.scala:156:48:
>  Passing an explicit array value to a Scala varargs method is deprecated 
> (since 2.13.0) and will result in a defensive copy; Use the more efficient 
> non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call
> [warn] Applicable -Wconf / @nowarn filters for this warning: msg= message>, cat=deprecation, 
> site=org.apache.spark.sql.hive.execution.ObjectHashAggregateSuite, 
> version=2.13.0
> [warn]         df.agg(aggFunctions.head, aggFunctions.tail: _*),
> [warn]                                                ^
> [warn] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ObjectHashAggregateSuite.scala:161:76:
>  Passing an explicit array value to a Scala varargs method is deprecated 
> (since 2.13.0) and will result in a defensive copy; Use the more efficient 
> non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call
> [warn] Applicable -Wconf / @nowarn filters for this warning: msg= message>, cat=deprecation, 
> site=org.apache.spark.sql.hive.execution.ObjectHashAggregateSuite, 
> version=2.13.0
> [warn]         df.groupBy($"id" % 4 as "mod").agg(aggFunctions.head, 
> aggFunctions.tail: _*),
> [warn]                                                                        
>     ^
> [warn] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ObjectHashAggregateSuite.scala:171:50:
>  Passing an explicit array value to a Scala varargs method is deprecated 
> (since 2.13.0) and will result in a defensive copy; Use the more efficient 
> non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call
> [warn] Applicable -Wconf / @nowarn filters for this warning: msg= message>, cat=deprecation, 
> site=org.apache.spark.sql.hive.execution.ObjectHashAggregateSuite, 
> version=2.13.0
> [warn]           df.agg(aggFunctions.head, aggFunctions.tail: _*),
> [warn]  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45687) Fix `Passing an explicit array value to a Scala varargs method is deprecated`

2023-11-10 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-45687:


Assignee: Tengfei Huang

> Fix `Passing an explicit array value to a Scala varargs method is deprecated`
> -
>
> Key: SPARK-45687
> URL: https://issues.apache.org/jira/browse/SPARK-45687
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Tengfei Huang
>Priority: Major
>  Labels: pull-request-available
>
> Passing an explicit array value to a Scala varargs method is deprecated 
> (since 2.13.0) and will result in a defensive copy; Use the more efficient 
> non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call
>  
> {code:java}
> [warn] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala:945:21:
>  Passing an explicit array value to a Scala varargs method is deprecated 
> (since 2.13.0) and will result in a defensive copy; Use the more efficient 
> non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call
> [warn] Applicable -Wconf / @nowarn filters for this warning: msg= message>, cat=deprecation, 
> site=org.apache.spark.sql.hive.execution.AggregationQuerySuite, version=2.13.0
> [warn]         df.agg(udaf(allColumns: _*)),
> [warn]                     ^
> [warn] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ObjectHashAggregateSuite.scala:156:48:
>  Passing an explicit array value to a Scala varargs method is deprecated 
> (since 2.13.0) and will result in a defensive copy; Use the more efficient 
> non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call
> [warn] Applicable -Wconf / @nowarn filters for this warning: msg= message>, cat=deprecation, 
> site=org.apache.spark.sql.hive.execution.ObjectHashAggregateSuite, 
> version=2.13.0
> [warn]         df.agg(aggFunctions.head, aggFunctions.tail: _*),
> [warn]                                                ^
> [warn] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ObjectHashAggregateSuite.scala:161:76:
>  Passing an explicit array value to a Scala varargs method is deprecated 
> (since 2.13.0) and will result in a defensive copy; Use the more efficient 
> non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call
> [warn] Applicable -Wconf / @nowarn filters for this warning: msg= message>, cat=deprecation, 
> site=org.apache.spark.sql.hive.execution.ObjectHashAggregateSuite, 
> version=2.13.0
> [warn]         df.groupBy($"id" % 4 as "mod").agg(aggFunctions.head, 
> aggFunctions.tail: _*),
> [warn]                                                                        
>     ^
> [warn] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ObjectHashAggregateSuite.scala:171:50:
>  Passing an explicit array value to a Scala varargs method is deprecated 
> (since 2.13.0) and will result in a defensive copy; Use the more efficient 
> non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call
> [warn] Applicable -Wconf / @nowarn filters for this warning: msg= message>, cat=deprecation, 
> site=org.apache.spark.sql.hive.execution.ObjectHashAggregateSuite, 
> version=2.13.0
> [warn]           df.agg(aggFunctions.head, aggFunctions.tail: _*),
> [warn]  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45878) ConcurrentModificationException in CliSuite

2023-11-10 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-45878.
--
Fix Version/s: 3.5.1
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 43749
[https://github.com/apache/spark/pull/43749]

> ConcurrentModificationException in CliSuite
> ---
>
> Key: SPARK-45878
> URL: https://issues.apache.org/jira/browse/SPARK-45878
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0, 3.4.2
>
>
> {code:java}
> // code placeholder
> java.util.ConcurrentModificationException: mutation occurred during iteration
> [info]   at 
> scala.collection.mutable.MutationTracker$.checkMutations(MutationTracker.scala:43)
> [info]   at 
> scala.collection.mutable.CheckedIndexedSeqView$CheckedIterator.hasNext(CheckedIndexedSeqView.scala:47)
> [info]   at 
> scala.collection.IterableOnceOps.addString(IterableOnce.scala:1247)
> [info]   at 
> scala.collection.IterableOnceOps.addString$(IterableOnce.scala:1241)
> [info]   at scala.collection.AbstractIterable.addString(Iterable.scala:933)
> [info]   at scala.collection.IterableOnceOps.mkString(IterableOnce.scala:1191)
> [info]   at 
> scala.collection.IterableOnceOps.mkString$(IterableOnce.scala:1189)
> [info]   at scala.collection.AbstractIterable.mkString(Iterable.scala:933)
> [info]   at scala.collection.IterableOnceOps.mkString(IterableOnce.scala:1204)
> [info]   at 
> scala.collection.IterableOnceOps.mkString$(IterableOnce.scala:1204)
> [info]   at scala.collection.AbstractIterable.mkString(Iterable.scala:933)
> [info]   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.runCliWithin(CliSuite.scala:205)
> [info]   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$new$20(CliSuite.scala:501)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45878) ConcurrentModificationException in CliSuite

2023-11-10 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-45878:


Assignee: Kent Yao

> ConcurrentModificationException in CliSuite
> ---
>
> Key: SPARK-45878
> URL: https://issues.apache.org/jira/browse/SPARK-45878
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> // code placeholder
> java.util.ConcurrentModificationException: mutation occurred during iteration
> [info]   at 
> scala.collection.mutable.MutationTracker$.checkMutations(MutationTracker.scala:43)
> [info]   at 
> scala.collection.mutable.CheckedIndexedSeqView$CheckedIterator.hasNext(CheckedIndexedSeqView.scala:47)
> [info]   at 
> scala.collection.IterableOnceOps.addString(IterableOnce.scala:1247)
> [info]   at 
> scala.collection.IterableOnceOps.addString$(IterableOnce.scala:1241)
> [info]   at scala.collection.AbstractIterable.addString(Iterable.scala:933)
> [info]   at scala.collection.IterableOnceOps.mkString(IterableOnce.scala:1191)
> [info]   at 
> scala.collection.IterableOnceOps.mkString$(IterableOnce.scala:1189)
> [info]   at scala.collection.AbstractIterable.mkString(Iterable.scala:933)
> [info]   at scala.collection.IterableOnceOps.mkString(IterableOnce.scala:1204)
> [info]   at 
> scala.collection.IterableOnceOps.mkString$(IterableOnce.scala:1204)
> [info]   at scala.collection.AbstractIterable.mkString(Iterable.scala:933)
> [info]   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.runCliWithin(CliSuite.scala:205)
> [info]   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$new$20(CliSuite.scala:501)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45886) Output full stack trace in callSite of DataFrame context

2023-11-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45886:
---
Labels: pull-request-available  (was: )

> Output full stack trace in callSite of DataFrame context
> 
>
> Key: SPARK-45886
> URL: https://issues.apache.org/jira/browse/SPARK-45886
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
>
> Include all available items to callSite of DataFrame context. This will 
> simplifies user issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45886) Output full stack trace in callSite of DataFrame context

2023-11-10 Thread Max Gekk (Jira)

Max Gekk created SPARK-45886:


 Summary: Output full stack trace in callSite of DataFrame context
 Key: SPARK-45886
 URL: https://issues.apache.org/jira/browse/SPARK-45886
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Max Gekk
Assignee: Max Gekk


Include all available items to callSite of DataFrame context. This will 
simplifies user issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45851) (Scala) Support different retry policies for connect client

2023-11-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45851:
---
Labels: pull-request-available  (was: )

> (Scala) Support different retry policies for connect client
> ---
>
> Key: SPARK-45851
> URL: https://issues.apache.org/jira/browse/SPARK-45851
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Alice Sayutina
>Priority: Major
>  Labels: pull-request-available
>
> Support multiple retry policies defined at the same time. Each policy 
> determines which error types it can retry and how exactly.
> For instance, networking errors should generally be retried differently that
> remote resource being available.
> Relevant python ticket: SPARK-45733



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45885) Upgrade ORC to 1.7.10

2023-11-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45885:
---
Labels: pull-request-available  (was: )

> Upgrade ORC to 1.7.10
> -
>
> Key: SPARK-45885
> URL: https://issues.apache.org/jira/browse/SPARK-45885
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.3
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45884) Upgrade ORC to 1.8.6

2023-11-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45884:
---
Labels: pull-request-available  (was: )

> Upgrade ORC to 1.8.6
> 
>
> Key: SPARK-45884
> URL: https://issues.apache.org/jira/browse/SPARK-45884
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.2
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45884) Upgrade ORC to 1.8.6

2023-11-10 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-45884:
-

 Summary: Upgrade ORC to 1.8.6
 Key: SPARK-45884
 URL: https://issues.apache.org/jira/browse/SPARK-45884
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.4.2
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45883) Upgrade ORC to 1.9.2

2023-11-10 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-45883:
-

 Summary: Upgrade ORC to 1.9.2
 Key: SPARK-45883
 URL: https://issues.apache.org/jira/browse/SPARK-45883
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.5.1
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45882) BroadcastHashJoinExec propagate partitioning should respect CoalescedHashPartitioning

2023-11-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45882:
---
Labels: pull-request-available  (was: )

> BroadcastHashJoinExec propagate partitioning should respect 
> CoalescedHashPartitioning
> -
>
> Key: SPARK-45882
> URL: https://issues.apache.org/jira/browse/SPARK-45882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45881) Use Higher Order aggregate functions from SQL

2023-11-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45881:
---
Labels: pull-request-available  (was: )

> Use Higher Order aggregate functions from SQL
> -
>
> Key: SPARK-45881
> URL: https://issues.apache.org/jira/browse/SPARK-45881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Steven Aerts
>Priority: Major
>  Labels: pull-request-available
>
> Higher order aggregate funtions are aggregation function which take a lambda 
> function as a parameter.
> An example for this from presto is the runction 
> {{[reduce_agg|https://prestodb.io/docs/current/functions/aggregate.html#reduce_agg]}}
>  which has the signature {{reduce_agg(inputValue T, initialState S, 
> inputFunction(S, T, S), combineFunction(S, S, S))}} and it works like this:
> {code:java}
> SELECT id, reduce_agg(value, 0, (a, b) -> a + b, (a, b) -> a + b)
> FROM (VALUES (1, 2), (1, 3), (1, 4), (2, 20), (2, 30), (2, 40)) AS t(id, 
> value)
> GROUP BY id;
> -- (1, 9)
> -- (2, 90)
> {code}
> In Spark you can today define, implement and use such a custom function from 
> the scala API by implementing a case class which extends from 
> {{TypedImperativeAggregate}} and add the {{HigherOrderFunction}} trait.
> However if you try to use this function from the sql api, you get:
> {code:java}
> org.apache.spark.sql.AnalysisException: A lambda function should only be used 
> in a higher order function. However, its class is 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> which is not a higher order function.; line 2 pos 2
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$22$$anonfun$applyOrElse$153.$anonfun$applyOrElse$155(Analyzer.scala:2142)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$22$$anonfun$applyOrElse$153.$anonfun$applyOrElse$154(Analyzer.scala:2135)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:100)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$22$$anonfun$applyOrElse$153.applyOrElse(Analyzer.scala:2143)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$22$$anonfun$applyOrElse$153.applyOrElse(Analyzer.scala:2132)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
> {code}
> There is just a small thing missing in the {{Analyzer}} to get all of this 
> working, we will provide a fix, unblocking higher order aggregate functions 
> in spark sql.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45882) BroadcastHashJoinExec propagate partitioning should respect CoalescedHashPartitioning

2023-11-10 Thread XiDuo You (Jira)

XiDuo You created SPARK-45882:
-

 Summary: BroadcastHashJoinExec propagate partitioning should 
respect CoalescedHashPartitioning
 Key: SPARK-45882
 URL: https://issues.apache.org/jira/browse/SPARK-45882
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45881) Use Higher Order aggregate functions from SQL

2023-11-10 Thread Steven Aerts (Jira)

Steven Aerts created SPARK-45881:


 Summary: Use Higher Order aggregate functions from SQL
 Key: SPARK-45881
 URL: https://issues.apache.org/jira/browse/SPARK-45881
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Steven Aerts


Higher order aggregate funtions are aggregation function which take a lambda 
function as a parameter.

An example for this from presto is the runction 
{{[reduce_agg|https://prestodb.io/docs/current/functions/aggregate.html#reduce_agg]}}
 which has the signature {{reduce_agg(inputValue T, initialState S, 
inputFunction(S, T, S), combineFunction(S, S, S))}} and it works like this:
{code:java}
SELECT id, reduce_agg(value, 0, (a, b) -> a + b, (a, b) -> a + b)
FROM (VALUES (1, 2), (1, 3), (1, 4), (2, 20), (2, 30), (2, 40)) AS t(id, value)
GROUP BY id;
-- (1, 9)
-- (2, 90)
{code}
In Spark you can today define, implement and use such a custom function from 
the scala API by implementing a case class which extends from 
{{TypedImperativeAggregate}} and add the {{HigherOrderFunction}} trait.

However if you try to use this function from the sql api, you get:
{code:java}
org.apache.spark.sql.AnalysisException: A lambda function should only be used 
in a higher order function. However, its class is 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, which 
is not a higher order function.; line 2 pos 2
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$22$$anonfun$applyOrElse$153.$anonfun$applyOrElse$155(Analyzer.scala:2142)
  at scala.Option.map(Option.scala:230)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$22$$anonfun$applyOrElse$153.$anonfun$applyOrElse$154(Analyzer.scala:2135)
  at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:100)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$22$$anonfun$applyOrElse$153.applyOrElse(Analyzer.scala:2143)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$22$$anonfun$applyOrElse$153.applyOrElse(Analyzer.scala:2132)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
{code}
There is just a small thing missing in the {{Analyzer}} to get all of this 
working, we will provide a fix, unblocking higher order aggregate functions in 
spark sql.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45879) Number check for InputFileBlockSources is missing for V2 source (BatchScan) ?

2023-11-10 Thread Liangcai li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liangcai li updated SPARK-45879:

Description: 
When doing a join with the "input_file_name()" function, it will blow up with a

*AnalysisException* if using the v1 data source (FileSourceScan). That's ok.

 

But if we change to use the v2 data source (BatchScan), the expected exception 
is gone, and the join passes.

Is this number check for InputFileDataSources mssing for V2 data source ? or is 
it by design ?

 

Repro steps:
{code:java}
scala> spark.range(100).withColumn("const1", 
lit("from_t1")).write.parquet("/data/tmp/t1")
 
scala> spark.range(100).withColumn("const2", 
lit("from_t2")).write.parquet("/data/tmp/t2")
 
scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more 
than one sources.; line 1 pos 0;
Project id#376L, const1#377, const2#381, input_file_name() AS 
input_file_name()#389
+- Project id#376L, const1#377, const2#381
   +- Join Inner, (id#376L = id#380L)
      :- Relation id#376L,const1#377 parquet
      +- Relation id#380L,const2#381 parquet
 
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2(rules.scala:472)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2$adapted(rules.scala:472)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)

scala> spark.conf.set("spark.sql.sources.useV1SourceList", "")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
+---+---+---+---+
|id |const1 |const2 |input_file_name()                                          
                            |
+---+---+---+---+
|91 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|92 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|93 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|94 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|95 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
+---+---+---+---+
only showing top 5 rows{code}

  was:
When doing a join with the "input_file_name()" function, it will blow up with a

*AnalysisException* if using the v1 data source (FileSourceScan). That's ok.

 

But if we change to use the v2 data source (BatchScan), the expected exception 
is gone, and the join passed.

Is this number check for InputFileDataSources mssing for V2 data source ? or is 
it by design ?

 

Repro steps:
{code:java}
scala> spark.range(100).withColumn("const1", 
lit("from_t1")).write.parquet("/data/tmp/t1")
 
scala> spark.range(100).withColumn("const2", 
lit("from_t2")).write.parquet("/data/tmp/t2")
 
scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more 
than one sources.; line 1 pos 0;
Project id#376L, const1#377, const2#381, input_file_name() AS 
input_file_name()#389
+- Project id#376L, const1#377, const2#381
   +- Join Inner, (id#376L = id#380L)
      :- Relation id#376L,const1#377 parquet
      +- Relation id#380L,const2#381 parquet
 
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
  at

[jira] [Updated] (SPARK-45879) Number check for InputFileBlockSources is missing for V2 source (BatchScan) ?

2023-11-10 Thread Liangcai li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liangcai li updated SPARK-45879:

Description: 
When doing a join with the "input_file_name()" function, it will blow up with a

*AnalysisException* if using the v1 data source (FileSourceScan). That's ok.

 

But if we change to use the v2 data source (BatchScan), the expected exception 
is gone, and the join passed.

Is this number check for InputFileDataSources mssing for V2 data source ? or is 
it by design ?

 

Repro steps:
{code:java}
scala> spark.range(100).withColumn("const1", 
lit("from_t1")).write.parquet("/data/tmp/t1")
 
scala> spark.range(100).withColumn("const2", 
lit("from_t2")).write.parquet("/data/tmp/t2")
 
scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more 
than one sources.; line 1 pos 0;
Project id#376L, const1#377, const2#381, input_file_name() AS 
input_file_name()#389
+- Project id#376L, const1#377, const2#381
   +- Join Inner, (id#376L = id#380L)
      :- Relation id#376L,const1#377 parquet
      +- Relation id#380L,const2#381 parquet
 
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2(rules.scala:472)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2$adapted(rules.scala:472)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)

scala> spark.conf.set("spark.sql.sources.useV1SourceList", "")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
+---+---+---+---+
|id |const1 |const2 |input_file_name()                                          
                            |
+---+---+---+---+
|91 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|92 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|93 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|94 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|95 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
+---+---+---+---+
only showing top 5 rows{code}

  was:
When doing a join with the "input_file_name()" function, it will blow up with a

*AnalysisException* if using the v1 data source (FileSourceScan). That's ok.

 

But if we change to use the v2 data source (BatchScan), the expected exception 
is gone, the join passed.

Is this number check for InputFileDataSources mssing for V2 data source ? or it 
is by design ?

 

Repro steps:
{code:java}
scala> spark.range(100).withColumn("const1", 
lit("from_t1")).write.parquet("/data/tmp/t1")
 
scala> spark.range(100).withColumn("const2", 
lit("from_t2")).write.parquet("/data/tmp/t2")
 
scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more 
than one sources.; line 1 pos 0;
Project id#376L, const1#377, const2#381, input_file_name() AS 
input_file_name()#389
+- Project id#376L, const1#377, const2#381
   +- Join Inner, (id#376L = id#380L)
      :- Relation id#376L,const1#377 parquet
      +- Relation id#380L,const2#381 parquet
 
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
  at

[jira] [Updated] (SPARK-45879) Number check for InputFileBlockSources is missing for V2 source (BatchScan) ?

2023-11-10 Thread Liangcai li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liangcai li updated SPARK-45879:

Environment: I tried on Spark 323 and Spark 341, and both can reproduce 
this issue.  (was: I tried on Spark 323 and Spark 341, and both reproduced this 
issue.)

> Number check for InputFileBlockSources is missing for V2 source (BatchScan) ?
> -
>
> Key: SPARK-45879
> URL: https://issues.apache.org/jira/browse/SPARK-45879
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: I tried on Spark 323 and Spark 341, and both can 
> reproduce this issue.
>Reporter: Liangcai li
>Priority: Major
>
> When doing a join with the "input_file_name()" function, it will blow up with 
> a
> *AnalysisException* if using the v1 data source (FileSourceScan). That's ok.
>  
> But if we change to use the v2 data source (BatchScan), the expected 
> exception is gone, the join passed.
> Is this number check for InputFileDataSources mssing for V2 data source ? or 
> it is by design ?
>  
> Repro steps:
> {code:java}
> scala> spark.range(100).withColumn("const1", 
> lit("from_t1")).write.parquet("/data/tmp/t1")
>  
> scala> spark.range(100).withColumn("const2", 
> lit("from_t2")).write.parquet("/data/tmp/t2")
>  
> scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
>  
> scala> 
> spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
> "id", "inner").selectExpr("*", "input_file_name()").show(5, false)
> org.apache.spark.sql.AnalysisException: 'input_file_name' does not support 
> more than one sources.; line 1 pos 0;
> Project id#376L, const1#377, const2#381, input_file_name() AS 
> input_file_name()#389
> +- Project id#376L, const1#377, const2#381
>    +- Join Inner, (id#376L = id#380L)
>       :- Relation id#376L,const1#377 parquet
>       +- Relation id#380L,const2#381 parquet
>  
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
>   at 
> org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)
>   at 
> org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2(rules.scala:472)
>   at 
> org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2$adapted(rules.scala:472)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> scala> spark.conf.set("spark.sql.sources.useV1SourceList", "")
>  
> scala> 
> spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
> "id", "inner").selectExpr("*", "input_file_name()").show(5, false)
> +---+---+---+---+
> |id |const1 |const2 |input_file_name()                                        
>                               |
> +---+---+---+---+
> |91 
> |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
> |92 
> |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
> |93 
> |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
> |94 
> |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
> |95 
> |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
> +---+---+---+---+
> only showing top 5 rows{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45879) Number check for InputFileBlockSources is missing for V2 source (BatchScan) ?

2023-11-10 Thread Liangcai li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liangcai li updated SPARK-45879:

Environment: I tried on Spark 323 and Spark 341, and both reproduced this 
issue.  (was: I tried on Spark 323 and Spark 341, both reproduced this issue.)

> Number check for InputFileBlockSources is missing for V2 source (BatchScan) ?
> -
>
> Key: SPARK-45879
> URL: https://issues.apache.org/jira/browse/SPARK-45879
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: I tried on Spark 323 and Spark 341, and both reproduced 
> this issue.
>Reporter: Liangcai li
>Priority: Major
>
> When doing a join with the "input_file_name()" function, it will blow up with 
> a
> *AnalysisException* if using the v1 data source (FileSourceScan). That's ok.
>  
> But if we change to use the v2 data source (BatchScan), the expected 
> exception is gone, the join passed.
> Is this number check for InputFileDataSources mssing for V2 data source ? or 
> it is by design ?
>  
> Repro steps:
> {code:java}
> scala> spark.range(100).withColumn("const1", 
> lit("from_t1")).write.parquet("/data/tmp/t1")
>  
> scala> spark.range(100).withColumn("const2", 
> lit("from_t2")).write.parquet("/data/tmp/t2")
>  
> scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
>  
> scala> 
> spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
> "id", "inner").selectExpr("*", "input_file_name()").show(5, false)
> org.apache.spark.sql.AnalysisException: 'input_file_name' does not support 
> more than one sources.; line 1 pos 0;
> Project id#376L, const1#377, const2#381, input_file_name() AS 
> input_file_name()#389
> +- Project id#376L, const1#377, const2#381
>    +- Join Inner, (id#376L = id#380L)
>       :- Relation id#376L,const1#377 parquet
>       +- Relation id#380L,const2#381 parquet
>  
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
>   at 
> org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)
>   at 
> org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2(rules.scala:472)
>   at 
> org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2$adapted(rules.scala:472)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> scala> spark.conf.set("spark.sql.sources.useV1SourceList", "")
>  
> scala> 
> spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
> "id", "inner").selectExpr("*", "input_file_name()").show(5, false)
> +---+---+---+---+
> |id |const1 |const2 |input_file_name()                                        
>                               |
> +---+---+---+---+
> |91 
> |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
> |92 
> |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
> |93 
> |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
> |94 
> |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
> |95 
> |from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
> +---+---+---+---+
> only showing top 5 rows{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45879) Number check for InputFileBlockSources is missing for V2 source (BatchScan) ?

2023-11-10 Thread Liangcai li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liangcai li updated SPARK-45879:

Description: 
When doing a join with the "input_file_name()" function, it will blow up with a

*AnalysisException* if using the v1 data source (FileSourceScan). That's ok.

 

But if we change to use the v2 data source (BatchScan), the expected exception 
is gone, the join passed.

Is this number check for InputFileDataSources mssing for V2 data source ? or it 
is by design ?

 

Repro steps:
{code:java}
scala> spark.range(100).withColumn("const1", 
lit("from_t1")).write.parquet("/data/tmp/t1")
 
scala> spark.range(100).withColumn("const2", 
lit("from_t2")).write.parquet("/data/tmp/t2")
 
scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more 
than one sources.; line 1 pos 0;
Project id#376L, const1#377, const2#381, input_file_name() AS 
input_file_name()#389
+- Project id#376L, const1#377, const2#381
   +- Join Inner, (id#376L = id#380L)
      :- Relation id#376L,const1#377 parquet
      +- Relation id#380L,const2#381 parquet
 
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2(rules.scala:472)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2$adapted(rules.scala:472)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)

scala> spark.conf.set("spark.sql.sources.useV1SourceList", "")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
+---+---+---+---+
|id |const1 |const2 |input_file_name()                                          
                            |
+---+---+---+---+
|91 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|92 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|93 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|94 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|95 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
+---+---+---+---+
only showing top 5 rows{code}

  was:
When doing a join with the "input_file_name()" function, it will blow up with a

*AnalysisException* if using the v1 data source (FileSourceScan). That's ok.

 

But if we change to use the v2 data source (BatchScan), the expected exception 
is gone, the join passed.

Is this number check for InputFileDataSources mssing for V2 data source?

 

Repro steps:
{code:java}
scala> spark.range(100).withColumn("const1", 
lit("from_t1")).write.parquet("/data/tmp/t1")
 
scala> spark.range(100).withColumn("const2", 
lit("from_t2")).write.parquet("/data/tmp/t2")
 
scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more 
than one sources.; line 1 pos 0;
Project id#376L, const1#377, const2#381, input_file_name() AS 
input_file_name()#389
+- Project id#376L, const1#377, const2#381
   +- Join Inner, (id#376L = id#380L)
      :- Relation id#376L,const1#377 parquet
      +- Relation id#380L,const2#381 parquet
 
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)
  at

[jira] [Updated] (SPARK-45879) Number check for InputFileBlockSources is missing for V2 source (BatchScan) ?

2023-11-10 Thread Liangcai li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liangcai li updated SPARK-45879:

Description: 
When doing a join with the "input_file_name()" function, it will blow up with a

*AnalysisException* if using the v1 data source (FileSourceScan). That's ok.

 

But if we change to use the v2 data source (BatchScan), the expected exception 
is gone, the join passed.

Is this number check for InputFileDataSources mssing for V2 data source?

 

Repro steps:
{code:java}
scala> spark.range(100).withColumn("const1", 
lit("from_t1")).write.parquet("/data/tmp/t1")
 
scala> spark.range(100).withColumn("const2", 
lit("from_t2")).write.parquet("/data/tmp/t2")
 
scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more 
than one sources.; line 1 pos 0;
Project id#376L, const1#377, const2#381, input_file_name() AS 
input_file_name()#389
+- Project id#376L, const1#377, const2#381
   +- Join Inner, (id#376L = id#380L)
      :- Relation id#376L,const1#377 parquet
      +- Relation id#380L,const2#381 parquet
 
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2(rules.scala:472)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2$adapted(rules.scala:472)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)

scala> spark.conf.set("spark.sql.sources.useV1SourceList", "")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
+---+---+---+---+
|id |const1 |const2 |input_file_name()                                          
                            |
+---+---+---+---+
|91 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|92 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|93 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|94 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|95 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
+---+---+---+---+
only showing top 5 rows{code}

  was:
When doing a join with the "input_file_name()" function, it will blow up with a

*AnalysisException* if using the v1 data source (FileSourceScan). That's ok.

 

But if we change to use the v2 data source (BatchScan), the expected exception 
is gone, the join passes.

Is this number check for InputFileDataSources mssing for V2 data source?

 

Repro steps:
{code:java}
scala> spark.range(100).withColumn("const1", 
lit("from_t1")).write.parquet("/data/tmp/t1")
 
scala> spark.range(100).withColumn("const2", 
lit("from_t2")).write.parquet("/data/tmp/t2")
 
scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more 
than one sources.; line 1 pos 0;
Project id#376L, const1#377, const2#381, input_file_name() AS 
input_file_name()#389
+- Project id#376L, const1#377, const2#381
   +- Join Inner, (id#376L = id#380L)
      :- Relation id#376L,const1#377 parquet
      +- Relation id#380L,const2#381 parquet
 
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)
  at

[jira] [Updated] (SPARK-45879) Number check for InputFileBlockSources is missing for V2 source (BatchScan) ?

2023-11-10 Thread Liangcai li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liangcai li updated SPARK-45879:

Description: 
When doing a join with the "input_file_name()" function, it will blow up with a

*AnalysisException* if using the v1 data source (FileSourceScan). That's ok.

 

But if we change to use the v2 data source (BatchScan), the expected exception 
is gone, the join passes.

Is this number check for InputFileDataSources mssing for V2 data source?

 

Repro steps:
{code:java}
scala> spark.range(100).withColumn("const1", 
lit("from_t1")).write.parquet("/data/tmp/t1")
 
scala> spark.range(100).withColumn("const2", 
lit("from_t2")).write.parquet("/data/tmp/t2")
 
scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more 
than one sources.; line 1 pos 0;
Project id#376L, const1#377, const2#381, input_file_name() AS 
input_file_name()#389
+- Project id#376L, const1#377, const2#381
   +- Join Inner, (id#376L = id#380L)
      :- Relation id#376L,const1#377 parquet
      +- Relation id#380L,const2#381 parquet
 
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2(rules.scala:472)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2$adapted(rules.scala:472)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)

scala> spark.conf.set("spark.sql.sources.useV1SourceList", "")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
+---+---+---+---+
|id |const1 |const2 |input_file_name()                                          
                            |
+---+---+---+---+
|91 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|92 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|93 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|94 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|95 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
+---+---+---+---+
only showing top 5 rows{code}

  was:
When doing a join with the "input_file_name()" function, it will blow up with a

AnalysisException if using the v1 data source (FileSourceScan). That's ok.

But if we change to use the v2 data source (BatchScan), the expected exception 
is gone, the join passes.

Is this number check for InputFileDataSources mssing for V2 data source?

Repro steps:
{code:java}
scala> spark.range(100).withColumn("const1", 
lit("from_t1")).write.parquet("/data/tmp/t1")
 
scala> spark.range(100).withColumn("const2", 
lit("from_t2")).write.parquet("/data/tmp/t2")
 
scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more 
than one sources.; line 1 pos 0;
Project id#376L, const1#377, const2#381, input_file_name() AS 
input_file_name()#389
+- Project id#376L, const1#377, const2#381
   +- Join Inner, (id#376L = id#380L)
      :- Relation id#376L,const1#377 parquet
      +- Relation id#380L,const2#381 parquet
 
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)
  at

[jira] [Updated] (SPARK-45879) Number check for InputFileBlockSources is missing for V2 source (BatchScan) ?

2023-11-10 Thread Liangcai li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liangcai li updated SPARK-45879:

Description: 
When doing a join with the "input_file_name()" function, it will blow up with a

AnalysisException if using the v1 data source (FileSourceScan). That's ok.

But if we change to use the v2 data source (BatchScan), the expected exception 
is gone, the join passes.

Is this number check for InputFileDataSources mssing for V2 data source?

Repro steps:
{code:java}
scala> spark.range(100).withColumn("const1", 
lit("from_t1")).write.parquet("/data/tmp/t1")
 
scala> spark.range(100).withColumn("const2", 
lit("from_t2")).write.parquet("/data/tmp/t2")
 
scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more 
than one sources.; line 1 pos 0;
Project id#376L, const1#377, const2#381, input_file_name() AS 
input_file_name()#389
+- Project id#376L, const1#377, const2#381
   +- Join Inner, (id#376L = id#380L)
      :- Relation id#376L,const1#377 parquet
      +- Relation id#380L,const2#381 parquet
 
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2(rules.scala:472)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2$adapted(rules.scala:472)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)

scala> spark.conf.set("spark.sql.sources.useV1SourceList", "")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
+---+---+---+---+
|id |const1 |const2 |input_file_name()                                          
                            |
+---+---+---+---+
|91 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|92 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|93 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|94 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|95 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
+---+---+---+---+
only showing top 5 rows{code}

  was:
When doing a join with the "input_file_name()" function, it will blow up with a

AnalysisException if using the v1 data source (FileSourceScan). That's ok.

But if we change to use the v2 data source (BatchScan), the expected exception 
is gone, the join passes.

Is this number check for InputFileDataSources mssing for V2 data source?

Repro steps:
{code:java}
scala> spark.range(100).withColumn("const1", 
lit("from_t1")).write.parquet("/data/tmp/t1")
 
scala> spark.range(100).withColumn("const2", 
lit("from_t2")).write.parquet("/data/tmp/t2")
 
scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more 
than one sources.; line 1 pos 0;
Project id#376L, const1#377, const2#381, input_file_name() AS 
input_file_name()#389
+- Project id#376L, const1#377, const2#381
   +- Join Inner, (id#376L = id#380L)
      :- Relation id#376L,const1#377 parquet
      +- Relation id#380L,const2#381 parquet
 
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)
  at

[jira] [Updated] (SPARK-45879) Number check for InputFileBlockSources is missing for V2 source (BatchScan) ?

2023-11-10 Thread Liangcai li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liangcai li updated SPARK-45879:

Description: 
When doing a join with the "input_file_name()" function, it will blow up with a

AnalysisException if using the v1 data source (FileSourceScan). That's ok.

But if we change to use the v2 data source (BatchScan), the expected exception 
is gone, the join passes.

Is this number check for InputFileDataSources mssing for V2 data source?

Repro steps:
{code:java}
scala> spark.range(100).withColumn("const1", 
lit("from_t1")).write.parquet("/data/tmp/t1")
 
scala> spark.range(100).withColumn("const2", 
lit("from_t2")).write.parquet("/data/tmp/t2")
 
scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)
org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more 
than one sources.; line 1 pos 0;
Project id#376L, const1#377, const2#381, input_file_name() AS 
input_file_name()#389
+- Project id#376L, const1#377, const2#381
   +- Join Inner, (id#376L = id#380L)
      :- Relation id#376L,const1#377 parquet
      +- Relation id#380L,const2#381 parquet
 
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2(rules.scala:472)
  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2$adapted(rules.scala:472)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)

scala> spark.conf.set("spark.sql.sources.useV1SourceList", "")
 
scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false){code}
+---+---+---+---+
|id |const1 |const2 |input_file_name()  
|
+---+---+---+---+
|91 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|92 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|93 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|94 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
|95 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|
+---+---+---+---+
only showing top 5 rows

  was:
When doing a join with the "input_file_name()" function, it will blow up with a

AnalysisException if using the v1 data source (FileSourceScan). That's ok.

But if we change to use the v2 data source (BatchScan), the expected exception 
is gone, the join passes.

 Is this number check for InputFileDataSources mssing for V2 data source?

Repro steps:

```

scala> spark.range(100).withColumn("const1", 
lit("from_t1")).write.parquet("/data/tmp/t1")

 

scala> spark.range(100).withColumn("const2", 
lit("from_t2")).write.parquet("/data/tmp/t2")

 

scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")

 

scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)

org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more 
than one sources.; line 1 pos 0;

Project [id#376L, const1#377, const2#381, input_file_name() AS 
input_file_name()#389]

+- Project [id#376L, const1#377, const2#381]

   +- Join Inner, (id#376L = id#380L)

      :- Relation [id#376L,const1#377] parquet

      +- Relation [id#380L,const2#381] parquet

 

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)

  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)

  at

[jira] [Created] (SPARK-45879) Number check for InputFileBlockSources is missing for V2 source (BatchScan) ?

2023-11-10 Thread Liangcai li (Jira)

Liangcai li created SPARK-45879:
---

 Summary: Number check for InputFileBlockSources is missing for V2 
source (BatchScan) ?
 Key: SPARK-45879
 URL: https://issues.apache.org/jira/browse/SPARK-45879
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
 Environment: I tried on Spark 323 and Spark 341, both reproduced this 
issue.
Reporter: Liangcai li


When doing a join with the "input_file_name()" function, it will blow up with a

AnalysisException if using the v1 data source (FileSourceScan). That's ok.

But if we change to use the v2 data source (BatchScan), the expected exception 
is gone, the join passes.

 Is this number check for InputFileDataSources mssing for V2 data source?

Repro steps:

```

scala> spark.range(100).withColumn("const1", 
lit("from_t1")).write.parquet("/data/tmp/t1")

 

scala> spark.range(100).withColumn("const2", 
lit("from_t2")).write.parquet("/data/tmp/t2")

 

scala> spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")

 

scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)

org.apache.spark.sql.AnalysisException: 'input_file_name' does not support more 
than one sources.; line 1 pos 0;

Project [id#376L, const1#377, const2#381, input_file_name() AS 
input_file_name()#389]

+- Project [id#376L, const1#377, const2#381]

   +- Join Inner, (id#376L = id#380L)

      :- Relation [id#376L,const1#377] parquet

      +- Relation [id#380L,const2#381] parquet

 

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)

  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.org$apache$spark$sql$execution$datasources$PreReadCheck$$checkNumInputFileBlockSources(rules.scala:476)

  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2(rules.scala:472)

  at 
org.apache.spark.sql.execution.datasources.PreReadCheck$.$anonfun$checkNumInputFileBlockSources$2$adapted(rules.scala:472)

  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)

  at scala.collection.Iterator.foreach(Iterator.scala:943)

  at scala.collection.Iterator.foreach$(Iterator.scala:943)

  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)

  at scala.collection.IterableLike.foreach(IterableLike.scala:74)

  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)

```

```

scala> spark.conf.set("spark.sql.sources.useV1SourceList", "")

 

scala> 
spark.read.parquet("/data/tmp/t1").join(spark.read.parquet("/data/tmp/t2"), 
"id", "inner").selectExpr("*", "input_file_name()").show(5, false)

+---+---+---+---+

|id |const1 |const2 |input_file_name()                                          
                            |

+---+---+---+---+

|91 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|

|92 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|

|93 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|

|94 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|

|95 
|from_t1|from_t2|file:///data/tmp/t1/part-00011-a52b9990-4463-447c-9cdf-7a84542de2f7-c000.snappy.parquet|

+---+---+---+---+

only showing top 5 rows

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45837) Report underlying error in scala client

2023-11-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45837:


Assignee: Alice Sayutina

> Report underlying error in scala client
> ---
>
> Key: SPARK-45837
> URL: https://issues.apache.org/jira/browse/SPARK-45837
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Alice Sayutina
>Assignee: Alice Sayutina
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When there is retry-worthy error, we need to not just throw RetryException, 
> but also 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45837) Report underlying error in scala client

2023-11-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45837.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43719
[https://github.com/apache/spark/pull/43719]

> Report underlying error in scala client
> ---
>
> Key: SPARK-45837
> URL: https://issues.apache.org/jira/browse/SPARK-45837
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Alice Sayutina
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When there is retry-worthy error, we need to not just throw RetryException, 
> but also 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45852) Gracefully deal with recursion exception during Spark Connect logging

2023-11-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45852.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43732
[https://github.com/apache/spark/pull/43732]

> Gracefully deal with recursion exception during Spark Connect logging
> -
>
> Key: SPARK-45852
> URL: https://issues.apache.org/jira/browse/SPARK-45852
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> ```
> from google.protobuf.text_format import MessageToString
> from pyspark.sql.functions import col, lit
> df = spark.range(10)
> for x in range(800):
>   df = df.withColumn(f"next{x}", lit(1))
>   MessageToString(df._plan.to_proto(spark._client), as_one_line=True)
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

50 matches

Mail list logo