[jira] [Updated] (SPARK-41360) Avoid BlockManager re-registration if the executor has been lost

2022-12-01 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-41360:
-
Summary: Avoid BlockManager re-registration if the executor has been lost  
(was: Avoid BlockMananger re-registration if the executor has been lost)

> Avoid BlockManager re-registration if the executor has been lost
> 
>
> Key: SPARK-41360
> URL: https://issues.apache.org/jira/browse/SPARK-41360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.2, 3.3.1
>Reporter: wuyi
>Priority: Major
>
> We should avoid block manager re-registration if the executor has been lost 
> as it's meaningless and harmful, e.g., SPARK-35011



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41236) The renamed field name cannot be recognized after group filtering

2022-12-01 Thread Ritika Maheshwari (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642332#comment-17642332
 ] 

Ritika Maheshwari commented on SPARK-41236:
---

Running the following query against Spark 3.3.0 code that was downloaded. The 
error message is improved.

spark-sql> select collect_set(age) as age

         > from test2.ageGroups

         > GROUP BY name

         > having size(age) >1;

Error in query: cannot resolve 'size(age)' due to data type mismatch: argument 
1 requires (array or map) type, however, 'spark_catalog.test2.agegroups.age' is 
of int type.; line 4 pos 7;

'Filter (size('age, true) > 1)

+- Aggregate [name#29], [collect_set(age#30, 0, 0) AS age#27]

   +- SubqueryAlias spark_catalog.test2.agegroups

      +- HiveTableRelation [`test2`.`agegroups`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [eid#28, 
name#29, age#30], Partition Cols: []]

But this is confusing if it recognizes age as int the following query should 
not have failed. It fails complaining that age is an array as it it getting 
bound to the renamed column.

spark-sql> select collect_set(age) as age

         > from test2.ageGroups

         > GROUP BY name

         > Having age >1;

Error in query: cannot resolve '(age > 1)' due to data type mismatch: differing 
types in '(age > 1)' (array and int).; line 4 pos 7;

'Filter (age#62 > 1)

+- Aggregate [name#64], [collect_set(age#65, 0, 0) AS age#62]

   +- SubqueryAlias spark_catalog.test2.agegroups

      +- HiveTableRelation [`test2`.`agegroups`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [eid#63, 
name#64, age#65], Partition Cols: []]

 

 

> The renamed field name cannot be recognized after group filtering
> -
>
> Key: SPARK-41236
> URL: https://issues.apache.org/jira/browse/SPARK-41236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> {code:java}
> select collect_set(age) as age
> from db_table.table1
> group by name
> having size(age) > 1 
> {code}
> a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
> Is it a bug or a new standard?
> h3. *like this:*
> {code:sql}
> create db1.table1(age int, name string);
> insert into db1.table1 values(1, 'a');
> insert into db1.table1 values(2, 'b');
> insert into db1.table1 values(3, 'c');
> --then run sql like this 
> select collect_set(age) as age from db1.table1 group by name having size(age) 
> > 1 ;
> {code}
> h3. Stack Information
> org.apache.spark.sql.AnalysisException: cannot resolve 'age' given input 
> columns: [age]; line 4 pos 12;
> 'Filter (size('age, true) > 1)
> +- Aggregate [name#2], [collect_set(age#1, 0, 0) AS age#0]
>+- SubqueryAlias spark_catalog.db1.table1
>   +- HiveTableRelation [`db1`.`table1`, 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde, Data Cols: [age#1, name#2], 
> Partition Cols: []]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:54)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:179)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1128)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1127)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1154)
>   at 
> org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1153)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:555)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181)
>   

[jira] [Created] (SPARK-41360) Avoid BlockMananger re-registration if the executor has been lost

2022-12-01 Thread wuyi (Jira)
wuyi created SPARK-41360:


 Summary: Avoid BlockMananger re-registration if the executor has 
been lost
 Key: SPARK-41360
 URL: https://issues.apache.org/jira/browse/SPARK-41360
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.1, 3.2.2, 3.1.3, 3.0.3, 2.4.8
Reporter: wuyi


We should avoid block manager re-registration if the executor has been lost as 
it's meaningless and harmful, e.g., SPARK-35011



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41358) Use `PhysicalDataType` instead of DataType in ColumnVectorUtils

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642313#comment-17642313
 ] 

Apache Spark commented on SPARK-41358:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38873

> Use `PhysicalDataType` instead of DataType in ColumnVectorUtils
> ---
>
> Key: SPARK-41358
> URL: https://issues.apache.org/jira/browse/SPARK-41358
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41358) Use `PhysicalDataType` instead of DataType in ColumnVectorUtils

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41358:


Assignee: (was: Apache Spark)

> Use `PhysicalDataType` instead of DataType in ColumnVectorUtils
> ---
>
> Key: SPARK-41358
> URL: https://issues.apache.org/jira/browse/SPARK-41358
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41358) Use `PhysicalDataType` instead of DataType in ColumnVectorUtils

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41358:


Assignee: Apache Spark

> Use `PhysicalDataType` instead of DataType in ColumnVectorUtils
> ---
>
> Key: SPARK-41358
> URL: https://issues.apache.org/jira/browse/SPARK-41358
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41358) Use `PhysicalDataType` instead of DataType in ColumnVectorUtils

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642312#comment-17642312
 ] 

Apache Spark commented on SPARK-41358:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38873

> Use `PhysicalDataType` instead of DataType in ColumnVectorUtils
> ---
>
> Key: SPARK-41358
> URL: https://issues.apache.org/jira/browse/SPARK-41358
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41359) Use `PhysicalDataType` instead of DataType in UnsafeRow

2022-12-01 Thread Yang Jie (Jira)
Yang Jie created SPARK-41359:


 Summary: Use `PhysicalDataType` instead of DataType in UnsafeRow
 Key: SPARK-41359
 URL: https://issues.apache.org/jira/browse/SPARK-41359
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41358) Use `PhysicalDataType` instead of DataType in ColumnVectorUtils

2022-12-01 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-41358:
-
Parent: SPARK-41356
Issue Type: Sub-task  (was: Improvement)

> Use `PhysicalDataType` instead of DataType in ColumnVectorUtils
> ---
>
> Key: SPARK-41358
> URL: https://issues.apache.org/jira/browse/SPARK-41358
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41358) Use `PhysicalDataType` instead of DataType in ColumnVectorUtils

2022-12-01 Thread Yang Jie (Jira)
Yang Jie created SPARK-41358:


 Summary: Use `PhysicalDataType` instead of DataType in 
ColumnVectorUtils
 Key: SPARK-41358
 URL: https://issues.apache.org/jira/browse/SPARK-41358
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41356) Find and refactor all cases suitable for using `PhysicalDataType` instead of `DataType`

2022-12-01 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-41356:
-
Description: SPARK-41226 Introduce physical data types, and the use 
PhysicalDataType replace DataType for many case, but there should also be 
refactorable cases in the code. We can add some task at any time to complete 
the relevant refactoring

> Find and refactor all cases suitable for using `PhysicalDataType` instead of 
> `DataType`
> ---
>
> Key: SPARK-41356
> URL: https://issues.apache.org/jira/browse/SPARK-41356
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> SPARK-41226 Introduce physical data types, and the use PhysicalDataType 
> replace DataType for many case, but there should also be refactorable cases 
> in the code. We can add some task at any time to complete the relevant 
> refactoring



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41357) Implement math functions

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41357:


Assignee: (was: Apache Spark)

> Implement math functions
> 
>
> Key: SPARK-41357
> URL: https://issues.apache.org/jira/browse/SPARK-41357
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41357) Implement math functions

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41357:


Assignee: Apache Spark

> Implement math functions
> 
>
> Key: SPARK-41357
> URL: https://issues.apache.org/jira/browse/SPARK-41357
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41357) Implement math functions

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642310#comment-17642310
 ] 

Apache Spark commented on SPARK-41357:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38872

> Implement math functions
> 
>
> Key: SPARK-41357
> URL: https://issues.apache.org/jira/browse/SPARK-41357
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41356) Find and refactor all cases suitable for using `PhysicalDataType` instead of `DataType`

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642309#comment-17642309
 ] 

Apache Spark commented on SPARK-41356:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38869

> Find and refactor all cases suitable for using `PhysicalDataType` instead of 
> `DataType`
> ---
>
> Key: SPARK-41356
> URL: https://issues.apache.org/jira/browse/SPARK-41356
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41356) Find and refactor all cases suitable for using `PhysicalDataType` instead of `DataType`

2022-12-01 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-41356:
-
Issue Type: Umbrella  (was: Improvement)

> Find and refactor all cases suitable for using `PhysicalDataType` instead of 
> `DataType`
> ---
>
> Key: SPARK-41356
> URL: https://issues.apache.org/jira/browse/SPARK-41356
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41357) Implement math functions

2022-12-01 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41357:
-

 Summary: Implement math functions
 Key: SPARK-41357
 URL: https://issues.apache.org/jira/browse/SPARK-41357
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35011) False active executor in UI that caused by BlockManager reregistration

2022-12-01 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-35011:
-
Summary: False active executor in UI that caused by BlockManager 
reregistration  (was: Avoid Block Manager registerations when StopExecutor msg 
is in-flight.)

> False active executor in UI that caused by BlockManager reregistration
> --
>
> Key: SPARK-35011
> URL: https://issues.apache.org/jira/browse/SPARK-35011
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Sumeet
>Assignee: wuyi
>Priority: Major
>  Labels: BlockManager, core
> Fix For: 3.3.0
>
>
> *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, 
> driver reports dead executors as alive.
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on 
> executorEndpoint
>  * "CoarseGrainedSchedulerBackend" removes that executor from Driver's 
> internal data structures and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus".
>  * Executor has still not processed "StopExecutor" from the Driver
>  * Driver receives heartbeat from the Executor, since it cannot find the 
> "executorId" in its data structures, it responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" 
> and "SparkListenerBlockManagerAdded" is published on the "listenerBus"
>  * Executor starts processing the "StopExecutor" and exits
>  * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and 
> updates "AppStatusStore"
>  * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list 
> of executors which returns the dead executor as alive.
>  
> *Proposed Solution:*
> Maintain a Cache of recently removed executors on Driver. During the 
> registration in BlockManagerMasterEndpoint if the BlockManager belongs to a 
> recently removed executor, return None indicating the registration is ignored 
> since the executor will be shutting down soon.
> On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed 
> executor, return true indicating the driver knows about it, thereby 
> preventing reregisteration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41344) Reading V2 datasource masks underlying error

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642303#comment-17642303
 ] 

Apache Spark commented on SPARK-41344:
--

User 'wForget' has created a pull request for this issue:
https://github.com/apache/spark/pull/38871

> Reading V2 datasource masks underlying error
> 
>
> Key: SPARK-41344
> URL: https://issues.apache.org/jira/browse/SPARK-41344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.4.0
>Reporter: Kevin Cheung
>Priority: Critical
>
> In Spark 3.3, 
>  # DataSourceV2Utils, the loadV2Source calls: 
> {*}(CatalogV2Util.loadTable(catalog, ident, timeTravel).get{*}, 
> Some(catalog), Some(ident)).
>  # CatalogV2Util.scala, when it tries to *loadTable(x,x,x)* and it fails with 
> any of these exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException, it would return None
>  # Coming back to DataSourceV2Utils, None was previously returned and calling 
> None.get results in a cryptic error technically "correct", but the *original 
> exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException are thrown away.*
>  
> *Ask:*
> Retain the original error and propagate this to the user. Prior to Spark 3.3, 
> the *original error* was shown and this seems like a design flaw.
>  
> *Sample user facing error:*
> None.get
> java.util.NoSuchElementException: None.get
>     at scala.None$.get(Option.scala:529)
>     at scala.None$.get(Option.scala:527)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:129)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
>     at scala.Option.flatMap(Option.scala:271)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
>  
> *DataSourceV2Utils.scala - CatalogV2Util.loadTable(x,x,x).get*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L137]
> *CatalogV2Util.scala - Option(catalog.asTableCatalog.loadTable(ident))*
> {*}{{*}}[https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L341]
> *CatalogV2Util.scala - catching the exceptions and return None*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L344]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41344) Reading V2 datasource masks underlying error

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41344:


Assignee: Apache Spark

> Reading V2 datasource masks underlying error
> 
>
> Key: SPARK-41344
> URL: https://issues.apache.org/jira/browse/SPARK-41344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.4.0
>Reporter: Kevin Cheung
>Assignee: Apache Spark
>Priority: Critical
>
> In Spark 3.3, 
>  # DataSourceV2Utils, the loadV2Source calls: 
> {*}(CatalogV2Util.loadTable(catalog, ident, timeTravel).get{*}, 
> Some(catalog), Some(ident)).
>  # CatalogV2Util.scala, when it tries to *loadTable(x,x,x)* and it fails with 
> any of these exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException, it would return None
>  # Coming back to DataSourceV2Utils, None was previously returned and calling 
> None.get results in a cryptic error technically "correct", but the *original 
> exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException are thrown away.*
>  
> *Ask:*
> Retain the original error and propagate this to the user. Prior to Spark 3.3, 
> the *original error* was shown and this seems like a design flaw.
>  
> *Sample user facing error:*
> None.get
> java.util.NoSuchElementException: None.get
>     at scala.None$.get(Option.scala:529)
>     at scala.None$.get(Option.scala:527)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:129)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
>     at scala.Option.flatMap(Option.scala:271)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
>  
> *DataSourceV2Utils.scala - CatalogV2Util.loadTable(x,x,x).get*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L137]
> *CatalogV2Util.scala - Option(catalog.asTableCatalog.loadTable(ident))*
> {*}{{*}}[https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L341]
> *CatalogV2Util.scala - catching the exceptions and return None*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L344]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41344) Reading V2 datasource masks underlying error

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41344:


Assignee: (was: Apache Spark)

> Reading V2 datasource masks underlying error
> 
>
> Key: SPARK-41344
> URL: https://issues.apache.org/jira/browse/SPARK-41344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.4.0
>Reporter: Kevin Cheung
>Priority: Critical
>
> In Spark 3.3, 
>  # DataSourceV2Utils, the loadV2Source calls: 
> {*}(CatalogV2Util.loadTable(catalog, ident, timeTravel).get{*}, 
> Some(catalog), Some(ident)).
>  # CatalogV2Util.scala, when it tries to *loadTable(x,x,x)* and it fails with 
> any of these exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException, it would return None
>  # Coming back to DataSourceV2Utils, None was previously returned and calling 
> None.get results in a cryptic error technically "correct", but the *original 
> exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException are thrown away.*
>  
> *Ask:*
> Retain the original error and propagate this to the user. Prior to Spark 3.3, 
> the *original error* was shown and this seems like a design flaw.
>  
> *Sample user facing error:*
> None.get
> java.util.NoSuchElementException: None.get
>     at scala.None$.get(Option.scala:529)
>     at scala.None$.get(Option.scala:527)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:129)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
>     at scala.Option.flatMap(Option.scala:271)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
>  
> *DataSourceV2Utils.scala - CatalogV2Util.loadTable(x,x,x).get*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L137]
> *CatalogV2Util.scala - Option(catalog.asTableCatalog.loadTable(ident))*
> {*}{{*}}[https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L341]
> *CatalogV2Util.scala - catching the exceptions and return None*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L344]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41344) Reading V2 datasource masks underlying error

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642304#comment-17642304
 ] 

Apache Spark commented on SPARK-41344:
--

User 'wForget' has created a pull request for this issue:
https://github.com/apache/spark/pull/38871

> Reading V2 datasource masks underlying error
> 
>
> Key: SPARK-41344
> URL: https://issues.apache.org/jira/browse/SPARK-41344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.4.0
>Reporter: Kevin Cheung
>Priority: Critical
>
> In Spark 3.3, 
>  # DataSourceV2Utils, the loadV2Source calls: 
> {*}(CatalogV2Util.loadTable(catalog, ident, timeTravel).get{*}, 
> Some(catalog), Some(ident)).
>  # CatalogV2Util.scala, when it tries to *loadTable(x,x,x)* and it fails with 
> any of these exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException, it would return None
>  # Coming back to DataSourceV2Utils, None was previously returned and calling 
> None.get results in a cryptic error technically "correct", but the *original 
> exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException are thrown away.*
>  
> *Ask:*
> Retain the original error and propagate this to the user. Prior to Spark 3.3, 
> the *original error* was shown and this seems like a design flaw.
>  
> *Sample user facing error:*
> None.get
> java.util.NoSuchElementException: None.get
>     at scala.None$.get(Option.scala:529)
>     at scala.None$.get(Option.scala:527)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:129)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
>     at scala.Option.flatMap(Option.scala:271)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
>  
> *DataSourceV2Utils.scala - CatalogV2Util.loadTable(x,x,x).get*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L137]
> *CatalogV2Util.scala - Option(catalog.asTableCatalog.loadTable(ident))*
> {*}{{*}}[https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L341]
> *CatalogV2Util.scala - catching the exceptions and return None*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L344]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41344) Reading V2 datasource masks underlying error

2022-12-01 Thread Zhen Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642298#comment-17642298
 ] 

Zhen Wang commented on SPARK-41344:
---

I want to work on this and will send a PR later.

> Reading V2 datasource masks underlying error
> 
>
> Key: SPARK-41344
> URL: https://issues.apache.org/jira/browse/SPARK-41344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.4.0
>Reporter: Kevin Cheung
>Priority: Critical
>
> In Spark 3.3, 
>  # DataSourceV2Utils, the loadV2Source calls: 
> {*}(CatalogV2Util.loadTable(catalog, ident, timeTravel).get{*}, 
> Some(catalog), Some(ident)).
>  # CatalogV2Util.scala, when it tries to *loadTable(x,x,x)* and it fails with 
> any of these exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException, it would return None
>  # Coming back to DataSourceV2Utils, None was previously returned and calling 
> None.get results in a cryptic error technically "correct", but the *original 
> exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException are thrown away.*
>  
> *Ask:*
> Retain the original error and propagate this to the user. Prior to Spark 3.3, 
> the *original error* was shown and this seems like a design flaw.
>  
> *Sample user facing error:*
> None.get
> java.util.NoSuchElementException: None.get
>     at scala.None$.get(Option.scala:529)
>     at scala.None$.get(Option.scala:527)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:129)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
>     at scala.Option.flatMap(Option.scala:271)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
>  
> *DataSourceV2Utils.scala - CatalogV2Util.loadTable(x,x,x).get*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L137]
> *CatalogV2Util.scala - Option(catalog.asTableCatalog.loadTable(ident))*
> {*}{{*}}[https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L341]
> *CatalogV2Util.scala - catching the exceptions and return None*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L344]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41353) UNRESOLVED_ROUTINE error class

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41353:


Assignee: Apache Spark

> UNRESOLVED_ROUTINE error class
> --
>
> Key: SPARK-41353
> URL: https://issues.apache.org/jira/browse/SPARK-41353
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: Apache Spark
>Priority: Major
>
> We want to unify and  name:
> "_LEGACY_ERROR_TEMP_1041" : \{
>   "message" : [
> "Undefined function ."  ]
> },
> _LEGACY_ERROR_TEMP_1242" : \{
>   "message" : [
> "Undefined function: . This function is neither a 
> built-in/temporary function, nor a persistent function that is qualified as 
> ."  ]
> },"_LEGACY_ERROR_TEMP_1243" : {
>   "message" : [
> "Undefined function: "  ]
> I proposal is:
> UNRESOLVED_ROUTINE. routineName => `a`.`b`.`func`, routineSignature => [INT, 
> STRING] , searchPath => [`builtin`, `session`, `hiveMetaStore`.`default`]
> This assumes agreement to introduce `builtin` as optional qualifier for 
> builtin functions.
> And `session` a optional qualifier for temporary functions (separate PR).
> Q: Why ROUTINE?
> A: Some day we may want to support PROCEDURES and they will follow the name 
> rule and share the same namespace.
> Q:Why A PATH
> A: We do follow a hard coded path today with a fixed precedence  rule.
> Q: Why provide the signature
> A: Longterm we may support overloading of functions by arity, type or even 
> parameter name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41353) UNRESOLVED_ROUTINE error class

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41353:


Assignee: (was: Apache Spark)

> UNRESOLVED_ROUTINE error class
> --
>
> Key: SPARK-41353
> URL: https://issues.apache.org/jira/browse/SPARK-41353
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> We want to unify and  name:
> "_LEGACY_ERROR_TEMP_1041" : \{
>   "message" : [
> "Undefined function ."  ]
> },
> _LEGACY_ERROR_TEMP_1242" : \{
>   "message" : [
> "Undefined function: . This function is neither a 
> built-in/temporary function, nor a persistent function that is qualified as 
> ."  ]
> },"_LEGACY_ERROR_TEMP_1243" : {
>   "message" : [
> "Undefined function: "  ]
> I proposal is:
> UNRESOLVED_ROUTINE. routineName => `a`.`b`.`func`, routineSignature => [INT, 
> STRING] , searchPath => [`builtin`, `session`, `hiveMetaStore`.`default`]
> This assumes agreement to introduce `builtin` as optional qualifier for 
> builtin functions.
> And `session` a optional qualifier for temporary functions (separate PR).
> Q: Why ROUTINE?
> A: Some day we may want to support PROCEDURES and they will follow the name 
> rule and share the same namespace.
> Q:Why A PATH
> A: We do follow a hard coded path today with a fixed precedence  rule.
> Q: Why provide the signature
> A: Longterm we may support overloading of functions by arity, type or even 
> parameter name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41353) UNRESOLVED_ROUTINE error class

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642286#comment-17642286
 ] 

Apache Spark commented on SPARK-41353:
--

User 'srielau' has created a pull request for this issue:
https://github.com/apache/spark/pull/38870

> UNRESOLVED_ROUTINE error class
> --
>
> Key: SPARK-41353
> URL: https://issues.apache.org/jira/browse/SPARK-41353
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> We want to unify and  name:
> "_LEGACY_ERROR_TEMP_1041" : \{
>   "message" : [
> "Undefined function ."  ]
> },
> _LEGACY_ERROR_TEMP_1242" : \{
>   "message" : [
> "Undefined function: . This function is neither a 
> built-in/temporary function, nor a persistent function that is qualified as 
> ."  ]
> },"_LEGACY_ERROR_TEMP_1243" : {
>   "message" : [
> "Undefined function: "  ]
> I proposal is:
> UNRESOLVED_ROUTINE. routineName => `a`.`b`.`func`, routineSignature => [INT, 
> STRING] , searchPath => [`builtin`, `session`, `hiveMetaStore`.`default`]
> This assumes agreement to introduce `builtin` as optional qualifier for 
> builtin functions.
> And `session` a optional qualifier for temporary functions (separate PR).
> Q: Why ROUTINE?
> A: Some day we may want to support PROCEDURES and they will follow the name 
> rule and share the same namespace.
> Q:Why A PATH
> A: We do follow a hard coded path today with a fixed precedence  rule.
> Q: Why provide the signature
> A: Longterm we may support overloading of functions by arity, type or even 
> parameter name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41356) Find and refactor all cases suitable for using `PhysicalDataType` instead of `DataType`

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642270#comment-17642270
 ] 

Apache Spark commented on SPARK-41356:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38869

> Find and refactor all cases suitable for using `PhysicalDataType` instead of 
> `DataType`
> ---
>
> Key: SPARK-41356
> URL: https://issues.apache.org/jira/browse/SPARK-41356
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-41236) The renamed field name cannot be recognized after group filtering

2022-12-01 Thread Ritika Maheshwari (Jira)


[ https://issues.apache.org/jira/browse/SPARK-41236 ]


Ritika Maheshwari deleted comment on SPARK-41236:
---

was (Author: ritikam):
Hello Zhong ,

Try to rename the field as a different name than the original column name.
select collect_set(age) as ageCol
from db_table.table1
group by name
having size(ageCol) > 1 
 

Although your result will be zero rows. Because you have only one age for each 
of your names "a","b" and "c"

Therefore size(ageCol) >1 will fail.

But if you have your table as 

age name

1      "a"

2      "a"

3     "a"

4      "b"

5      "c"

6      "c"

 

Then you will get a result

[1,2,3]

[5,6]

 

 

> The renamed field name cannot be recognized after group filtering
> -
>
> Key: SPARK-41236
> URL: https://issues.apache.org/jira/browse/SPARK-41236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> {code:java}
> select collect_set(age) as age
> from db_table.table1
> group by name
> having size(age) > 1 
> {code}
> a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
> Is it a bug or a new standard?
> h3. *like this:*
> {code:sql}
> create db1.table1(age int, name string);
> insert into db1.table1 values(1, 'a');
> insert into db1.table1 values(2, 'b');
> insert into db1.table1 values(3, 'c');
> --then run sql like this 
> select collect_set(age) as age from db1.table1 group by name having size(age) 
> > 1 ;
> {code}
> h3. Stack Information
> org.apache.spark.sql.AnalysisException: cannot resolve 'age' given input 
> columns: [age]; line 4 pos 12;
> 'Filter (size('age, true) > 1)
> +- Aggregate [name#2], [collect_set(age#1, 0, 0) AS age#0]
>+- SubqueryAlias spark_catalog.db1.table1
>   +- HiveTableRelation [`db1`.`table1`, 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde, Data Cols: [age#1, name#2], 
> Partition Cols: []]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:54)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:179)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1128)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1127)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1154)
>   at 
> org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1153)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:555)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175)
>   at 

[jira] [Updated] (SPARK-41356) Find and refactor all cases suitable for using `PhysicalDataType` instead of `DataType`

2022-12-01 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-41356:
-
Priority: Major  (was: Minor)

> Find and refactor all cases suitable for using `PhysicalDataType` instead of 
> `DataType`
> ---
>
> Key: SPARK-41356
> URL: https://issues.apache.org/jira/browse/SPARK-41356
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41356) Find and refactor all cases suitable for using `PhysicalDataType` instead of `DataType`

2022-12-01 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-41356:
-
Summary: Find and refactor all cases suitable for using `PhysicalDataType` 
instead of `DataType`  (was: Refactor `ColumnVectorUtils#populate` method to 
use physical types)

> Find and refactor all cases suitable for using `PhysicalDataType` instead of 
> `DataType`
> ---
>
> Key: SPARK-41356
> URL: https://issues.apache.org/jira/browse/SPARK-41356
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41356) Refactor `ColumnVectorUtils#populate` method to use physical types

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642254#comment-17642254
 ] 

Apache Spark commented on SPARK-41356:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38868

> Refactor `ColumnVectorUtils#populate` method to use physical types
> --
>
> Key: SPARK-41356
> URL: https://issues.apache.org/jira/browse/SPARK-41356
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41356) Refactor `ColumnVectorUtils#populate` method to use physical types

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41356:


Assignee: (was: Apache Spark)

> Refactor `ColumnVectorUtils#populate` method to use physical types
> --
>
> Key: SPARK-41356
> URL: https://issues.apache.org/jira/browse/SPARK-41356
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41356) Refactor `ColumnVectorUtils#populate` method to use physical types

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642253#comment-17642253
 ] 

Apache Spark commented on SPARK-41356:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38868

> Refactor `ColumnVectorUtils#populate` method to use physical types
> --
>
> Key: SPARK-41356
> URL: https://issues.apache.org/jira/browse/SPARK-41356
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41356) Refactor `ColumnVectorUtils#populate` method to use physical types

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41356:


Assignee: Apache Spark

> Refactor `ColumnVectorUtils#populate` method to use physical types
> --
>
> Key: SPARK-41356
> URL: https://issues.apache.org/jira/browse/SPARK-41356
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41356) Refactor `ColumnVectorUtils#populate` method to use physical types

2022-12-01 Thread Yang Jie (Jira)
Yang Jie created SPARK-41356:


 Summary: Refactor `ColumnVectorUtils#populate` method to use 
physical types
 Key: SPARK-41356
 URL: https://issues.apache.org/jira/browse/SPARK-41356
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41349) Implement `DataFrame.hint`

2022-12-01 Thread Deng Ziming (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642238#comment-17642238
 ] 

Deng Ziming commented on SPARK-41349:
-

Thank you [~amaliujia] , glad to have a try.

> Implement `DataFrame.hint`
> --
>
> Key: SPARK-41349
> URL: https://issues.apache.org/jira/browse/SPARK-41349
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> implement DataFrame.hint with the proto message added in 
> https://issues.apache.org/jira/browse/SPARK-41345



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41336) BroadcastExchange does not support the execute() code path. when AQE enabled

2022-12-01 Thread JacobZheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JacobZheng resolved SPARK-41336.

Fix Version/s: 3.2.2
   Resolution: Fixed

> BroadcastExchange does not support the execute() code path. when AQE enabled
> 
>
> Key: SPARK-41336
> URL: https://issues.apache.org/jira/browse/SPARK-41336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: JacobZheng
>Priority: Major
> Fix For: 3.2.2
>
>
> I am getting an error when running the following code.
> {code:java}
> val df1 = 
> spark.read.format("delta").load("/table/n4bee1a51083e49e6adacf2a").selectExpr("ID","TITLE")
> val df2 = 
> spark.read.format("delta").load("/table/db8e1ef7f0fdb447d8aae2e7").selectExpr("ID","STATUS").filter("STATUS
>  == 3")
> val df3 = 
> spark.read.format("delta").load("/table/q56719945d2534c9c88eb669").selectExpr("EMPNO","TITLE","LEAVEID","WFINSTANCEID","SUBMIT1").filter("SUBMIT1
>  == 1")
> val df4 = 
> spark.read.format("delta").load("/table/pd39b547fb6c24382861af92").selectExpr("`年月`")
> val jr1 = 
> df3.join(df2,df3("WFINSTANCEID")===df2("ID"),"inner").select(df3("EMPNO").as("NEWEMPNO"),df3("TITLE").as("NEWTITLE"),df3("LEAVEID"))
> val jr2 = 
> jr1.join(df1,jr1("LEAVEID")===df1("ID"),"LEFT_OUTER").select(jr1("NEWEMPNO"),jr1("NEWTITLE"),df1("TITLE").as("TYPE"))
> val gr1 = 
> jr2.groupBy(jr2("NEWEMPNO").as("EMPNO__0"),jr2("TYPE").as("TYPE__1")).agg(Map.empty[String,String]).toDF("EMPNO","TYPE")
> val temp1 = gr1.selectExpr("*","9 as KEY")
> val temp2 = df4.selectExpr("*","9 as KEY")
> val jr3 = 
> temp1.join(temp2,temp1("KEY")===temp2("KEY"),"OUTER").select(temp1("EMPNO"),temp1("TYPE"),temp1("KEY"),temp2("`年月`"))
> jr3.show(200)
> {code}
> The error message is as follows
> {code:java}
> java.lang.UnsupportedOperationException: BroadcastExchange does not support 
> the execute() code path.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.executeCodePathUnsupportedError(QueryExecutionErrors.scala:1655)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:203)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:180)
>   at 
> org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:119)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:180)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:526)
>   at 
> org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:454)
>   at 
> org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:453)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:497)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:50)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:50)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:750)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:180)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:325)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:443)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:429)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338)
>   at 
> 

[jira] [Assigned] (SPARK-41355) Workaround hive table name validation issue

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41355:


Assignee: (was: Apache Spark)

> Workaround hive table name validation issue
> ---
>
> Key: SPARK-41355
> URL: https://issues.apache.org/jira/browse/SPARK-41355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Minor
>
> For example:
>  * We want to create a table called {{tAb_I}}
>  * Hive metastore will check if the table name is valid by 
> {{MetaStoreUtils.validateName(tbl.getTableName())}}
>  * Hive will call {{HiveStringUtils.normalizeIdentifier(tbl.getTableName())}} 
> and then save the save the table name to lower case, *but after setting the 
> local to "tr", it will be {{tab_ı}} which is not a valid table name*
>  * When we run alter table command, we will first get the hive table from 
> hive metastore which is not a valid table name.
>  * Update some properties or other, and then try to save it to hive metastore.
>  * Hive metastore will check if the table name is valid and then throw 
> exception {{org.apache.hadoop.hive.ql.metadata.HiveException: [tab_ı]: is not 
> a valid table name}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41355) Workaround hive table name validation issue

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41355:


Assignee: Apache Spark

> Workaround hive table name validation issue
> ---
>
> Key: SPARK-41355
> URL: https://issues.apache.org/jira/browse/SPARK-41355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Assignee: Apache Spark
>Priority: Minor
>
> For example:
>  * We want to create a table called {{tAb_I}}
>  * Hive metastore will check if the table name is valid by 
> {{MetaStoreUtils.validateName(tbl.getTableName())}}
>  * Hive will call {{HiveStringUtils.normalizeIdentifier(tbl.getTableName())}} 
> and then save the save the table name to lower case, *but after setting the 
> local to "tr", it will be {{tab_ı}} which is not a valid table name*
>  * When we run alter table command, we will first get the hive table from 
> hive metastore which is not a valid table name.
>  * Update some properties or other, and then try to save it to hive metastore.
>  * Hive metastore will check if the table name is valid and then throw 
> exception {{org.apache.hadoop.hive.ql.metadata.HiveException: [tab_ı]: is not 
> a valid table name}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41355) Workaround hive table name validation issue

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1764#comment-1764
 ] 

Apache Spark commented on SPARK-41355:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/38765

> Workaround hive table name validation issue
> ---
>
> Key: SPARK-41355
> URL: https://issues.apache.org/jira/browse/SPARK-41355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Minor
>
> For example:
>  * We want to create a table called {{tAb_I}}
>  * Hive metastore will check if the table name is valid by 
> {{MetaStoreUtils.validateName(tbl.getTableName())}}
>  * Hive will call {{HiveStringUtils.normalizeIdentifier(tbl.getTableName())}} 
> and then save the save the table name to lower case, *but after setting the 
> local to "tr", it will be {{tab_ı}} which is not a valid table name*
>  * When we run alter table command, we will first get the hive table from 
> hive metastore which is not a valid table name.
>  * Update some properties or other, and then try to save it to hive metastore.
>  * Hive metastore will check if the table name is valid and then throw 
> exception {{org.apache.hadoop.hive.ql.metadata.HiveException: [tab_ı]: is not 
> a valid table name}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41355) Workaround hive table name validation issue

2022-12-01 Thread Wan Kun (Jira)
Wan Kun created SPARK-41355:
---

 Summary: Workaround hive table name validation issue
 Key: SPARK-41355
 URL: https://issues.apache.org/jira/browse/SPARK-41355
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Wan Kun


For example:
 * We want to create a table called {{tAb_I}}
 * Hive metastore will check if the table name is valid by 
{{MetaStoreUtils.validateName(tbl.getTableName())}}
 * Hive will call {{HiveStringUtils.normalizeIdentifier(tbl.getTableName())}} 
and then save the save the table name to lower case, *but after setting the 
local to "tr", it will be {{tab_ı}} which is not a valid table name*
 * When we run alter table command, we will first get the hive table from hive 
metastore which is not a valid table name.
 * Update some properties or other, and then try to save it to hive metastore.
 * Hive metastore will check if the table name is valid and then throw 
exception {{org.apache.hadoop.hive.ql.metadata.HiveException: [tab_ı]: is not a 
valid table name}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41319) when-otherwise support

2022-12-01 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642221#comment-17642221
 ] 

Ruifeng Zheng commented on SPARK-41319:
---

I thought we need dedicated proto message for `when(...).otherwise(...)`, but I 
just found that we have `CaseWhen` in FunctionRegistry, so maybe we can still 
use `UnresolvedFunction` to express this.

I will update the status when I have more findings.

> when-otherwise support
> --
>
> Key: SPARK-41319
> URL: https://issues.apache.org/jira/browse/SPARK-41319
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> to support the `when(condition, col).otherwise(col)` function in Spark 
> Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41319) when-otherwise support

2022-12-01 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-41319:
--
Description: 
to support the `when(condition, col).otherwise(col)` function in Spark Connect.


> when-otherwise support
> --
>
> Key: SPARK-41319
> URL: https://issues.apache.org/jira/browse/SPARK-41319
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> to support the `when(condition, col).otherwise(col)` function in Spark 
> Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41354) Implement `DataFrame.repartitionByRange`

2022-12-01 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-41354:
-
Summary: Implement `DataFrame.repartitionByRange`  (was: Support 
`DataFrame.repartitionByRange`)

> Implement `DataFrame.repartitionByRange`
> 
>
> Key: SPARK-41354
> URL: https://issues.apache.org/jira/browse/SPARK-41354
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41354) Support `DataFrame.repartitionByRange`

2022-12-01 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-41354:


 Summary: Support `DataFrame.repartitionByRange`
 Key: SPARK-41354
 URL: https://issues.apache.org/jira/browse/SPARK-41354
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41234) High-order function: array_insert

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41234:


Assignee: (was: Apache Spark)

> High-order function: array_insert
> -
>
> Key: SPARK-41234
> URL: https://issues.apache.org/jira/browse/SPARK-41234
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_insert.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41234) High-order function: array_insert

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41234:


Assignee: Apache Spark

> High-order function: array_insert
> -
>
> Key: SPARK-41234
> URL: https://issues.apache.org/jira/browse/SPARK-41234
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_insert.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41234) High-order function: array_insert

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642207#comment-17642207
 ] 

Apache Spark commented on SPARK-41234:
--

User 'Daniel-Davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/38867

> High-order function: array_insert
> -
>
> Key: SPARK-41234
> URL: https://issues.apache.org/jira/browse/SPARK-41234
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_insert.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41349) Implement `DataFrame.hint`

2022-12-01 Thread Rui Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642185#comment-17642185
 ] 

Rui Wang commented on SPARK-41349:
--

cc [~dengziming] if you are interested in

> Implement `DataFrame.hint`
> --
>
> Key: SPARK-41349
> URL: https://issues.apache.org/jira/browse/SPARK-41349
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> implement DataFrame.hint with the proto message added in 
> https://issues.apache.org/jira/browse/SPARK-41345



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41353) UNRESOLVED_ROUTINE error class

2022-12-01 Thread Serge Rielau (Jira)
Serge Rielau created SPARK-41353:


 Summary: UNRESOLVED_ROUTINE error class
 Key: SPARK-41353
 URL: https://issues.apache.org/jira/browse/SPARK-41353
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Serge Rielau


We want to unify and  name:
"_LEGACY_ERROR_TEMP_1041" : \{
  "message" : [
"Undefined function ."  ]
},
_LEGACY_ERROR_TEMP_1242" : \{
  "message" : [
"Undefined function: . This function is neither a 
built-in/temporary function, nor a persistent function that is qualified as 
."  ]
},"_LEGACY_ERROR_TEMP_1243" : {
  "message" : [
"Undefined function: "  ]
I proposal is:
UNRESOLVED_ROUTINE. routineName => `a`.`b`.`func`, routineSignature => [INT, 
STRING] , searchPath => [`builtin`, `session`, `hiveMetaStore`.`default`]
This assumes agreement to introduce `builtin` as optional qualifier for builtin 
functions.
And `session` a optional qualifier for temporary functions (separate PR).

Q: Why ROUTINE?
A: Some day we may want to support PROCEDURES and they will follow the name 
rule and share the same namespace.

Q:Why A PATH
A: We do follow a hard coded path today with a fixed precedence  rule.

Q: Why provide the signature
A: Longterm we may support overloading of functions by arity, type or even 
parameter name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40970) Support List[Column] for Join's on argument.

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40970:


Assignee: Apache Spark  (was: Rui Wang)

> Support List[Column] for Join's on argument.
> 
>
> Key: SPARK-40970
> URL: https://issues.apache.org/jira/browse/SPARK-40970
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>
> Right now Join's on does not support a list of ColumnRef: [df.age == df2.age, 
> df.name == df2.name], we can improve the expression system to figure out a 
> way to support it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40970) Support List[Column] for Join's on argument.

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40970:


Assignee: Rui Wang  (was: Apache Spark)

> Support List[Column] for Join's on argument.
> 
>
> Key: SPARK-40970
> URL: https://issues.apache.org/jira/browse/SPARK-40970
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>
> Right now Join's on does not support a list of ColumnRef: [df.age == df2.age, 
> df.name == df2.name], we can improve the expression system to figure out a 
> way to support it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40970) Support List[Column] for Join's on argument.

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642166#comment-17642166
 ] 

Apache Spark commented on SPARK-40970:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38866

> Support List[Column] for Join's on argument.
> 
>
> Key: SPARK-40970
> URL: https://issues.apache.org/jira/browse/SPARK-40970
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>
> Right now Join's on does not support a list of ColumnRef: [df.age == df2.age, 
> df.name == df2.name], we can improve the expression system to figure out a 
> way to support it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40970) Support List[Column] for Join's on argument.

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40970:


Assignee: Apache Spark  (was: Rui Wang)

> Support List[Column] for Join's on argument.
> 
>
> Key: SPARK-40970
> URL: https://issues.apache.org/jira/browse/SPARK-40970
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>
> Right now Join's on does not support a list of ColumnRef: [df.age == df2.age, 
> df.name == df2.name], we can improve the expression system to figure out a 
> way to support it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40970) Support List[Column] for Join's on argument.

2022-12-01 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-40970:
-
Summary: Support List[Column] for Join's on argument.  (was: Support 
List[ColumnRef] for Join's on argument.)

> Support List[Column] for Join's on argument.
> 
>
> Key: SPARK-40970
> URL: https://issues.apache.org/jira/browse/SPARK-40970
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>
> Right now Join's on does not support a list of ColumnRef: [df.age == df2.age, 
> df.name == df2.name], we can improve the expression system to figure out a 
> way to support it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41232) High-order function: array_append

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41232:


Assignee: (was: Apache Spark)

> High-order function: array_append
> -
>
> Key: SPARK-41232
> URL: https://issues.apache.org/jira/browse/SPARK-41232
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_append.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41232) High-order function: array_append

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41232:


Assignee: Apache Spark

> High-order function: array_append
> -
>
> Key: SPARK-41232
> URL: https://issues.apache.org/jira/browse/SPARK-41232
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_append.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41232) High-order function: array_append

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642150#comment-17642150
 ] 

Apache Spark commented on SPARK-41232:
--

User 'infoankitp' has created a pull request for this issue:
https://github.com/apache/spark/pull/38865

> High-order function: array_append
> -
>
> Key: SPARK-41232
> URL: https://issues.apache.org/jira/browse/SPARK-41232
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_append.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36950) Normalize semi-structured data into tabular tables.

2022-12-01 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-36950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-36950:

Description: 
Many users get seminested data form JSON or XML. 
There are some problems with querying this data, where there are nested fields.
In pandas there is a 
[json_normalize|https://github.com/pandas-dev/pandas/blob/v1.3.3/pandas/io/json/_normalize.py#L112-L353]
 function that flat out nested dicts. 

Here are some examples for the use of those [Flatten Complex Nested JSON 
(PYSPARK)|https://stackoverflow.com/questions/73599398/flatten-complex-nested-json-pyspark/73666330#73666330]
[Unable to load jsonl nested file into a flattened 
dataframe|https://stackoverflow.com/questions/73546452/unable-to-load-jsonl-nested-file-into-a-flattened-dataframe/73594355#73594355]

With pandas users can use this function

 
{code:java}
def flatten_pandas(df_):
    #The same as flatten but for pandas
    have_list = df_.columns[df_.applymap(lambda x: isinstance(x, 
list)).any()].tolist()
    have_dict = df_.columns[df_.applymap(lambda x: isinstance(x, 
dict)).any()].tolist()
    have_nested = len(have_list) + len(have_dict)
    
    while have_nested!=0:
        if len(have_list)!=0:
            for _ in have_list:
                df_ = df_.explode(_)
                
        elif have_dict !=0:
            df_ = pd.json_normalize(json.loads(df_.to_json(force_ascii=False, 
orient="records")), sep=":")
        
        have_list = df_.columns[df_.applymap(lambda x: isinstance(x, 
list)).any()].tolist()
        have_dict = df_.columns[df_.applymap(lambda x: isinstance(x, 
dict)).any()].tolist()
        have_nested = len(have_list) + len(have_dict)
        
    return df_
{code}
 

With pyspark or pandas_api we don't have a function for getting dict to columns 
implemented. 
These are the functions that I'm using to do the same in pyspark.
{code:java}
from pyspark.sql.functions import *
from pyspark.sql.types import *


def flatten_test(df, sep="_"):
    """Returns a flattened dataframe.
    .. versionadded:: x.X.X

    Parameters
    --
    sep : str
        Delimiter for flatted columns. Default `_`

    Notes
    -
    Don`t use `.` as `sep`
    It won't work on nested data frames with more than one level.
    And you will have to use `columns.name`.

    Flattening Map Types will have to find every key in the column.
    This can be slow.

    Examples
    

    data_mixed = [
        {
            "state": "Florida",
            "shortname": "FL",
            "info": {"governor": "Rick Scott"},
            "counties": [
                {"name": "Dade", "population": 12345},
                {"name": "Broward", "population": 4},
                {"name": "Palm Beach", "population": 6},
            ],
        },
        {
            "state": "Ohio",
            "shortname": "OH",
            "info": {"governor": "John Kasich"},
            "counties": [
                {"name": "Summit", "population": 1234},
                {"name": "Cuyahoga", "population": 1337},
            ],
        },
    ]

    data_mixed = spark.createDataFrame(data=data_mixed)

    data_mixed.printSchema()

    root
    |-- counties: array (nullable = true)
    |    |-- element: map (containsNull = true)
    |    |    |-- key: string
    |    |    |-- value: string (valueContainsNull = true)
    |-- info: map (nullable = true)
    |    |-- key: string
    |    |-- value: string (valueContainsNull = true)
    |-- shortname: string (nullable = true)
    |-- state: string (nullable = true)


    data_mixed_flat = flatten_test(df, sep=":")
    data_mixed_flat.printSchema()
    root
    |-- shortname: string (nullable = true)
    |-- state: string (nullable = true)
    |-- counties:name: string (nullable = true)
    |-- counties:population: string (nullable = true)
    |-- info:governor: string (nullable = true)




    data = [
        {
            "id": 1,
            "name": "Cole Volk",
            "fitness": {"height": 130, "weight": 60},
        },
        {"name": "Mark Reg", "fitness": {"height": 130, "weight": 60}},
        {
            "id": 2,
            "name": "Faye Raker",
            "fitness": {"height": 130, "weight": 60},
        },
    ]


    df = spark.createDataFrame(data=data)

    df.printSchema()

    root
    |-- fitness: map (nullable = true)
    |    |-- key: string
    |    |-- value: long (valueContainsNull = true)
    |-- id: long (nullable = true)
    |-- name: string (nullable = true)

    df_flat = flatten_test(df, sep=":")

    df_flat.printSchema()

    root
    |-- id: long (nullable = true)
    |-- name: string (nullable = true)
    |-- fitness:height: long (nullable = true)
    |-- fitness:weight: long (nullable = true)

    data_struct = [
            (("James",None,"Smith"),"OH","M"),
            (("Anna","Rose",""),"NY","F"),
            

[jira] [Commented] (SPARK-41271) Parameterized SQL

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642130#comment-17642130
 ] 

Apache Spark commented on SPARK-41271:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38864

> Parameterized SQL
> -
>
> Key: SPARK-41271
> URL: https://issues.apache.org/jira/browse/SPARK-41271
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Enhance the Spark SQL API with support for parameterized SQL statements to 
> improve security and reusability. Application developers will be able to 
> write SQL with parameter markers whose values will be passed separately from 
> the SQL code and interpreted as literals. This will help prevent SQL 
> injection attacks for applications that generate SQL based on a user’s 
> selections, which is often done via a user interface.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41271) Parameterized SQL

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642129#comment-17642129
 ] 

Apache Spark commented on SPARK-41271:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38864

> Parameterized SQL
> -
>
> Key: SPARK-41271
> URL: https://issues.apache.org/jira/browse/SPARK-41271
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Enhance the Spark SQL API with support for parameterized SQL statements to 
> improve security and reusability. Application developers will be able to 
> write SQL with parameter markers whose values will be passed separately from 
> the SQL code and interpreted as literals. This will help prevent SQL 
> injection attacks for applications that generate SQL based on a user’s 
> selections, which is often done via a user interface.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40586) Decouple plan transformation and validation on server side

2022-12-01 Thread Rui Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642120#comment-17642120
 ] 

Rui Wang commented on SPARK-40586:
--

Contributions are welcome!

> Decouple plan transformation and validation on server side 
> ---
>
> Key: SPARK-40586
> URL: https://issues.apache.org/jira/browse/SPARK-40586
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>
> Project connect, from some perspectives, can be thought as replacing the SQL 
> parser to generate a parsed (but the difference that is unresolved) plan, 
> then the plan is passed to the analyzer. This means that connect should also 
> do validation on the proto as there are many in-validate parser cases that 
> analyzer does not expect to see, which potentially could cause problems if 
> connect only pass through the proto (of course have it translated) to 
> analyzer.
> Meanwhile I think this is a good idea to decouple the validation and 
> transformation so that we have two stages:
> stage 1: proto validation. For example validate if necessary fields are 
> populated or not.
> stage 2: transformation, which convert the proto to a plan with assumption 
> that the plan is valid parsed version of the plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40586) Decouple plan transformation and validation on server side

2022-12-01 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang reassigned SPARK-40586:


Assignee: (was: Rui Wang)

> Decouple plan transformation and validation on server side 
> ---
>
> Key: SPARK-40586
> URL: https://issues.apache.org/jira/browse/SPARK-40586
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>
> Project connect, from some perspectives, can be thought as replacing the SQL 
> parser to generate a parsed (but the difference that is unresolved) plan, 
> then the plan is passed to the analyzer. This means that connect should also 
> do validation on the proto as there are many in-validate parser cases that 
> analyzer does not expect to see, which potentially could cause problems if 
> connect only pass through the proto (of course have it translated) to 
> analyzer.
> Meanwhile I think this is a good idea to decouple the validation and 
> transformation so that we have two stages:
> stage 1: proto validation. For example validate if necessary fields are 
> populated or not.
> stage 2: transformation, which convert the proto to a plan with assumption 
> that the plan is valid parsed version of the plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-41352) Support DataFrame.hint

2022-12-01 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang deleted SPARK-41352:
-


> Support DataFrame.hint
> --
>
> Key: SPARK-41352
> URL: https://issues.apache.org/jira/browse/SPARK-41352
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Rui Wang
>Priority: Major
>
> We have hint in proto now: 
> https://github.com/apache/spark/commit/0f1c515179e5ed34aca27c51f500c26ca19cc748.
> The left work is adding the support in client and server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41352) Support DataFrame.hint

2022-12-01 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-41352:
-
Description: 
We have hint in proto now: 
https://github.com/apache/spark/commit/0f1c515179e5ed34aca27c51f500c26ca19cc748.

The left work is adding the support in client and server.

> Support DataFrame.hint
> --
>
> Key: SPARK-41352
> URL: https://issues.apache.org/jira/browse/SPARK-41352
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>
> We have hint in proto now: 
> https://github.com/apache/spark/commit/0f1c515179e5ed34aca27c51f500c26ca19cc748.
> The left work is adding the support in client and server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41352) Support DataFrame.hint

2022-12-01 Thread Rui Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642118#comment-17642118
 ] 

Rui Wang commented on SPARK-41352:
--

This JIRA is open to pick up. 

> Support DataFrame.hint
> --
>
> Key: SPARK-41352
> URL: https://issues.apache.org/jira/browse/SPARK-41352
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41352) Support DataFrame.hint

2022-12-01 Thread Rui Wang (Jira)
Rui Wang created SPARK-41352:


 Summary: Support DataFrame.hint
 Key: SPARK-41352
 URL: https://issues.apache.org/jira/browse/SPARK-41352
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41313) Combine fixes for SPARK-3900 and SPARK-21138

2022-12-01 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-41313:
-
Priority: Minor  (was: Major)

> Combine fixes for SPARK-3900 and SPARK-21138
> 
>
> Key: SPARK-41313
> URL: https://issues.apache.org/jira/browse/SPARK-41313
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Priority: Minor
>
> spark-3900 fixed the illegalStateException in cleanupStagingDir in 
> ApplicationMaster's shutdownhook. However, spark-21138 accidentally 
> reverted/undid that change when fixing the "Wrong FS" bug. Now, we are seeing 
> spark-3900 reported by our users at Linkedin. We need to bring back the fix 
> for spark-3900.
> The illegalStateException when creating a new filesystem object is due to the 
> limitation in hadoop that we can not register a shutdownhook during shutdown. 
> So, when a spark job fails during pre-launch, as part of shutdown, 
> cleanupStagingDir would be called. Then, if we attempt to create a new 
> filesystem object for the first time, hadoop would try to register a hook to 
> shutdown KeyProviderCache when creating a ClientContext for DFSClient. As a 
> result, we hit the illegalStateException. We should avoid the creation of a 
> new filesystem object in cleanupStagingDir() when it is called in a shutdown 
> hook. This was introduced in spark-3900. However, spark-21138 accidentally 
> reverted/undid that change. We need to bring back that fix to Spark to avoid 
> the illegalStateException.
>   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41313) Combine fixes for SPARK-3900 and SPARK-21138

2022-12-01 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-41313:
-
Target Version/s:   (was: 3.2.4, 3.3.2, 3.4.0)

> Combine fixes for SPARK-3900 and SPARK-21138
> 
>
> Key: SPARK-41313
> URL: https://issues.apache.org/jira/browse/SPARK-41313
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Priority: Major
>
> spark-3900 fixed the illegalStateException in cleanupStagingDir in 
> ApplicationMaster's shutdownhook. However, spark-21138 accidentally 
> reverted/undid that change when fixing the "Wrong FS" bug. Now, we are seeing 
> spark-3900 reported by our users at Linkedin. We need to bring back the fix 
> for spark-3900.
> The illegalStateException when creating a new filesystem object is due to the 
> limitation in hadoop that we can not register a shutdownhook during shutdown. 
> So, when a spark job fails during pre-launch, as part of shutdown, 
> cleanupStagingDir would be called. Then, if we attempt to create a new 
> filesystem object for the first time, hadoop would try to register a hook to 
> shutdown KeyProviderCache when creating a ClientContext for DFSClient. As a 
> result, we hit the illegalStateException. We should avoid the creation of a 
> new filesystem object in cleanupStagingDir() when it is called in a shutdown 
> hook. This was introduced in spark-3900. However, spark-21138 accidentally 
> reverted/undid that change. We need to bring back that fix to Spark to avoid 
> the illegalStateException.
>   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-12-01 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-41008:
-
Priority: Minor  (was: Major)

> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Minor
>
>  
> {code:python}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regression import IsotonicRegression as 
> IsotonicRegression_pyspark
> # The P(positives | model_score):
> # 0.6 -> 0.5 (1 out of the 2 labels is positive)
> # 0.333 -> 0.333 (1 out of the 3 labels is positive)
> # 0.20 -> 0.25 (1 out of the 4 labels is positive)
> tc_pd = pd.DataFrame({
> "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],   
>       
> "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
> "weight": 1,     }
> )
> # The fraction of positives for each of the distinct model_scores would be 
> the best fit.
> # Resulting in the following expected calibrated model_scores:
> # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
> 0.25]
> # The sklearn implementation of Isotonic Regression. 
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> tc_regressor_sklearn = 
> IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], 
> sample_weight=tc_pd['weight'])
> print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))
> # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]
> # The pyspark implementation of Isotonic Regression. 
> tc_df = spark.createDataFrame(tc_pd)
> tc_df = tc_df.withColumn('model_score', 
> F.col('model_score').cast(DoubleType()))
> isotonic_regressor_pyspark = 
> IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
> weightCol='weight')
> tc_model = isotonic_regressor_pyspark.fit(tc_df)
> tc_pd = tc_model.transform(tc_df).toPandas()
> print("pyspark:", tc_pd['prediction'].values)
> # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]
> # The result from the pyspark implementation seems unclear. Similar small toy 
> examples lead to similar non-expected results for the pyspark implementation. 
> # Strangely enough, for 'large' datasets, the difference between calibrated 
> model_scores generated by both implementations dissapears. 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41013) spark-3.1.2以cluster模式提交作业报 Could not initialize class com.github.luben.zstd.ZstdOutputStream

2022-12-01 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642001#comment-17642001
 ] 

Sean R. Owen commented on SPARK-41013:
--

Can you clarify the issue? this doesn't look like it directly relates to Spark, 
but, the error message is truncated. We need to see the underlying cause

> spark-3.1.2以cluster模式提交作业报 Could not initialize class 
> com.github.luben.zstd.ZstdOutputStream
> 
>
> Key: SPARK-41013
> URL: https://issues.apache.org/jira/browse/SPARK-41013
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: yutiantian
>Priority: Major
>  Labels: libzstd-jni, spark.shuffle.mapStatus.compression.codec, 
> zstd
>
> 使用spark-3.1.2版本以cluster模式提交作业,报
> Could not initialize class com.github.luben.zstd.ZstdOutputStream。具体日志如下:
> Exception in thread "map-output-dispatcher-0" Exception in thread 
> "map-output-dispatcher-2" java.lang.ExceptionInInitializerError: Cannot 
> unpack libzstd-jni: No such file or directory at 
> java.io.UnixFileSystem.createFileExclusively(Native Method) at 
> java.io.File.createTempFile(File.java:2024) at 
> com.github.luben.zstd.util.Native.load(Native.java:97) at 
> com.github.luben.zstd.util.Native.load(Native.java:55) at 
> com.github.luben.zstd.ZstdOutputStream.(ZstdOutputStream.java:16) at 
> org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223)
>  at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910)
>  at 
> org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230)
>  at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Exception in thread 
> "map-output-dispatcher-7" Exception in thread "map-output-dispatcher-5" 
> java.lang.NoClassDefFoundError: Could not initialize class 
> com.github.luben.zstd.ZstdOutputStream at 
> org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223)
>  at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910)
>  at 
> org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230)
>  at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Exception in thread 
> "map-output-dispatcher-4" Exception in thread "map-output-dispatcher-3" 
> java.lang.NoClassDefFoundError: Could not initialize class 
> com.github.luben.zstd.ZstdOutputStream at 
> org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223)
>  at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910)
>  at 
> org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230)
>  at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) java.lang.NoClassDefFoundError: 
> Could not initialize class com.github.luben.zstd.ZstdOutputStream at 
> org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223)
>  at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910)
>  at 
> org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> 

[jira] [Resolved] (SPARK-41087) Make `build/mvn` use the same JAVA_OPTS as `dev/make-distribution.sh`

2022-12-01 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-41087.
--
Fix Version/s: 3.4.0
 Assignee: Yang Jie
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/38589

> Make `build/mvn` use the same JAVA_OPTS as `dev/make-distribution.sh`
> -
>
> Key: SPARK-41087
> URL: https://issues.apache.org/jira/browse/SPARK-41087
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0

2022-12-01 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641994#comment-17641994
 ] 

Sean R. Owen commented on SPARK-41219:
--

Is this the same as https://issues.apache.org/jira/browse/SPARK-41207 ?

> Regression in IntegralDivide returning null instead of 0
> 
>
> Key: SPARK-41219
> URL: https://issues.apache.org/jira/browse/SPARK-41219
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Raza Jafri
>Priority: Major
>
> There seems to be a regression in Spark 3.4 Integral Divide
>  
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                             null|
> |                             null|
> +-+
> {code}
>  
> While in Spark 3.3.0
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                                0|
> |                                0|
> +-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41253) Make K8s volcano IT work in Github Action

2022-12-01 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-41253:
-
Issue Type: Improvement  (was: Bug)
  Priority: Minor  (was: Major)

> Make K8s volcano IT work in Github Action
> -
>
> Key: SPARK-41253
> URL: https://issues.apache.org/jira/browse/SPARK-41253
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41324) Follow-up on JDK-8180450

2022-12-01 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641992#comment-17641992
 ] 

Sean R. Owen commented on SPARK-41324:
--

Or isn't this a dupe of https://issues.apache.org/jira/browse/SPARK-41318

> Follow-up on JDK-8180450
> 
>
> Key: SPARK-41324
> URL: https://issues.apache.org/jira/browse/SPARK-41324
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.3.1
>Reporter: Herman van Hövell
>Priority: Major
>
> Per [https://twitter.com/forked_franz/status/1597468851968831489]
> We should follow-up on: [https://bugs.openjdk.org/browse/JDK-8180450]
> There are two concrete tasks here:
>  # Upgrade to Netty 4.1.84.
>  # (Optional) Write a benchmark that exercises this code path. Anchoring this 
> in the build will be a bit of challenge though.
>  # Check if there are other places where this bug manifests itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41319) when-otherwise support

2022-12-01 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641993#comment-17641993
 ] 

Sean R. Owen commented on SPARK-41319:
--

Can we get a little more detail in these sub-tasks?

> when-otherwise support
> --
>
> Key: SPARK-41319
> URL: https://issues.apache.org/jira/browse/SPARK-41319
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41324) Follow-up on JDK-8180450

2022-12-01 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641990#comment-17641990
 ] 

Sean R. Owen commented on SPARK-41324:
--

Can we get a little more detail in here / update the title? what's the issue?

> Follow-up on JDK-8180450
> 
>
> Key: SPARK-41324
> URL: https://issues.apache.org/jira/browse/SPARK-41324
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.3.1
>Reporter: Herman van Hövell
>Priority: Major
>
> Per [https://twitter.com/forked_franz/status/1597468851968831489]
> We should follow-up on: [https://bugs.openjdk.org/browse/JDK-8180450]
> There are two concrete tasks here:
>  # Upgrade to Netty 4.1.84.
>  # (Optional) Write a benchmark that exercises this code path. Anchoring this 
> in the build will be a bit of challenge though.
>  # Check if there are other places where this bug manifests itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41342) Add support for distributed deep learning framework

2022-12-01 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641989#comment-17641989
 ] 

Sean R. Owen commented on SPARK-41342:
--

Why not Horovod? it works with Spark and Pytorch. 

> Add support for distributed deep learning framework
> ---
>
> Key: SPARK-41342
> URL: https://issues.apache.org/jira/browse/SPARK-41342
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.2
>Reporter: Lu Wang
>Priority: Major
>
> There is a clear trend for deep learning to go from single-machine to 
> distributed to scale/accelerate training. Adding a support for Distributed DL 
> solution on Spark will increase the power for spark and largely simplify the 
> distributed DL workload for the users. 
> Currently, 
> [spark-tensorflow-distributor|https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor]
>  provides a solution to run distributed Tensorflow on spark clusters.But 
> there is no such support for distributed PyTorch. 
> We want to add a general framework to support both DL frameworks so that we 
> can have a unified interface for distributed DL workload on spark. And it can 
> take the advantages for GPU scheduling on spark and have a better resource 
> management too. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41351) Column does not support !=

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641940#comment-17641940
 ] 

Apache Spark commented on SPARK-41351:
--

User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/38863

> Column does not support !=
> --
>
> Key: SPARK-41351
> URL: https://issues.apache.org/jira/browse/SPARK-41351
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41351) Column does not support !=

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641939#comment-17641939
 ] 

Apache Spark commented on SPARK-41351:
--

User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/38863

> Column does not support !=
> --
>
> Key: SPARK-41351
> URL: https://issues.apache.org/jira/browse/SPARK-41351
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41351) Column does not support !=

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41351:


Assignee: (was: Apache Spark)

> Column does not support !=
> --
>
> Key: SPARK-41351
> URL: https://issues.apache.org/jira/browse/SPARK-41351
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41351) Column does not support !=

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41351:


Assignee: Apache Spark

> Column does not support !=
> --
>
> Key: SPARK-41351
> URL: https://issues.apache.org/jira/browse/SPARK-41351
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41351) Column does not support !=

2022-12-01 Thread Martin Grund (Jira)
Martin Grund created SPARK-41351:


 Summary: Column does not support !=
 Key: SPARK-41351
 URL: https://issues.apache.org/jira/browse/SPARK-41351
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Martin Grund






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41350) allow simple name access of using join hidden columns after subquery alias

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41350:


Assignee: Apache Spark

> allow simple name access of using join hidden columns after subquery alias
> --
>
> Key: SPARK-41350
> URL: https://issues.apache.org/jira/browse/SPARK-41350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41350) allow simple name access of using join hidden columns after subquery alias

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641930#comment-17641930
 ] 

Apache Spark commented on SPARK-41350:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/38862

> allow simple name access of using join hidden columns after subquery alias
> --
>
> Key: SPARK-41350
> URL: https://issues.apache.org/jira/browse/SPARK-41350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41350) allow simple name access of using join hidden columns after subquery alias

2022-12-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41350:


Assignee: (was: Apache Spark)

> allow simple name access of using join hidden columns after subquery alias
> --
>
> Key: SPARK-41350
> URL: https://issues.apache.org/jira/browse/SPARK-41350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41350) allow simple name access of using join hidden columns after subquery alias

2022-12-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641929#comment-17641929
 ] 

Apache Spark commented on SPARK-41350:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/38862

> allow simple name access of using join hidden columns after subquery alias
> --
>
> Key: SPARK-41350
> URL: https://issues.apache.org/jira/browse/SPARK-41350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41350) allow simple name access of using join hidden columns after subquery alias

2022-12-01 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-41350:
---

 Summary: allow simple name access of using join hidden columns 
after subquery alias
 Key: SPARK-41350
 URL: https://issues.apache.org/jira/browse/SPARK-41350
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.1
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41348) Refactor `UnsafeArrayWriterSuite` to check error class

2022-12-01 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-41348:


Assignee: Yang Jie

> Refactor `UnsafeArrayWriterSuite` to check error class
> --
>
> Key: SPARK-41348
> URL: https://issues.apache.org/jira/browse/SPARK-41348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41348) Refactor `UnsafeArrayWriterSuite` to check error class

2022-12-01 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-41348.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38860
[https://github.com/apache/spark/pull/38860]

> Refactor `UnsafeArrayWriterSuite` to check error class
> --
>
> Key: SPARK-41348
> URL: https://issues.apache.org/jira/browse/SPARK-41348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41336) BroadcastExchange does not support the execute() code path. when AQE enabled

2022-12-01 Thread JacobZheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641843#comment-17641843
 ] 

JacobZheng commented on SPARK-41336:


[SPARK-39551][SQL][3.2] Add AQE invalid plan check solved this case.

> BroadcastExchange does not support the execute() code path. when AQE enabled
> 
>
> Key: SPARK-41336
> URL: https://issues.apache.org/jira/browse/SPARK-41336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: JacobZheng
>Priority: Major
>
> I am getting an error when running the following code.
> {code:java}
> val df1 = 
> spark.read.format("delta").load("/table/n4bee1a51083e49e6adacf2a").selectExpr("ID","TITLE")
> val df2 = 
> spark.read.format("delta").load("/table/db8e1ef7f0fdb447d8aae2e7").selectExpr("ID","STATUS").filter("STATUS
>  == 3")
> val df3 = 
> spark.read.format("delta").load("/table/q56719945d2534c9c88eb669").selectExpr("EMPNO","TITLE","LEAVEID","WFINSTANCEID","SUBMIT1").filter("SUBMIT1
>  == 1")
> val df4 = 
> spark.read.format("delta").load("/table/pd39b547fb6c24382861af92").selectExpr("`年月`")
> val jr1 = 
> df3.join(df2,df3("WFINSTANCEID")===df2("ID"),"inner").select(df3("EMPNO").as("NEWEMPNO"),df3("TITLE").as("NEWTITLE"),df3("LEAVEID"))
> val jr2 = 
> jr1.join(df1,jr1("LEAVEID")===df1("ID"),"LEFT_OUTER").select(jr1("NEWEMPNO"),jr1("NEWTITLE"),df1("TITLE").as("TYPE"))
> val gr1 = 
> jr2.groupBy(jr2("NEWEMPNO").as("EMPNO__0"),jr2("TYPE").as("TYPE__1")).agg(Map.empty[String,String]).toDF("EMPNO","TYPE")
> val temp1 = gr1.selectExpr("*","9 as KEY")
> val temp2 = df4.selectExpr("*","9 as KEY")
> val jr3 = 
> temp1.join(temp2,temp1("KEY")===temp2("KEY"),"OUTER").select(temp1("EMPNO"),temp1("TYPE"),temp1("KEY"),temp2("`年月`"))
> jr3.show(200)
> {code}
> The error message is as follows
> {code:java}
> java.lang.UnsupportedOperationException: BroadcastExchange does not support 
> the execute() code path.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.executeCodePathUnsupportedError(QueryExecutionErrors.scala:1655)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:203)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:180)
>   at 
> org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:119)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:180)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:526)
>   at 
> org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:454)
>   at 
> org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:453)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:497)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:50)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:50)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:750)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:180)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:325)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:443)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:429)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338)
>   at 
> 

[jira] [Resolved] (SPARK-41345) Add Hint to Connect Proto

2022-12-01 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41345.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38857
[https://github.com/apache/spark/pull/38857]

> Add Hint to Connect Proto
> -
>
> Key: SPARK-41345
> URL: https://issues.apache.org/jira/browse/SPARK-41345
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41349) Implement `DataFrame.hint`

2022-12-01 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641833#comment-17641833
 ] 

Ruifeng Zheng commented on SPARK-41349:
---

contributions are welcomed!

> Implement `DataFrame.hint`
> --
>
> Key: SPARK-41349
> URL: https://issues.apache.org/jira/browse/SPARK-41349
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> implement DataFrame.hint with the proto message added in 
> https://issues.apache.org/jira/browse/SPARK-41345



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41349) Implement `DataFrame.hint`

2022-12-01 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-41349:
--
Description: implement DataFrame.hint with the proto message added in 
https://issues.apache.org/jira/browse/SPARK-41345

> Implement `DataFrame.hint`
> --
>
> Key: SPARK-41349
> URL: https://issues.apache.org/jira/browse/SPARK-41349
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> implement DataFrame.hint with the proto message added in 
> https://issues.apache.org/jira/browse/SPARK-41345



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41349) Implement `DataFrame.hint`

2022-12-01 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41349:
-

 Summary: Implement `DataFrame.hint`
 Key: SPARK-41349
 URL: https://issues.apache.org/jira/browse/SPARK-41349
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41346) Implement asc and desc methods

2022-12-01 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41346.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38858
[https://github.com/apache/spark/pull/38858]

> Implement asc and desc methods
> --
>
> Key: SPARK-41346
> URL: https://issues.apache.org/jira/browse/SPARK-41346
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41347) Add Cast to Expression proto

2022-12-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-41347.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38859
[https://github.com/apache/spark/pull/38859]

> Add Cast to Expression proto
> 
>
> Key: SPARK-41347
> URL: https://issues.apache.org/jira/browse/SPARK-41347
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Critical
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41314) Assign a name to the error class _LEGACY_ERROR_TEMP_1094

2022-12-01 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-41314:


Assignee: Yang Jie

> Assign a name to the error class _LEGACY_ERROR_TEMP_1094
> 
>
> Key: SPARK-41314
> URL: https://issues.apache.org/jira/browse/SPARK-41314
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41314) Assign a name to the error class _LEGACY_ERROR_TEMP_1094

2022-12-01 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-41314.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38856
[https://github.com/apache/spark/pull/38856]

> Assign a name to the error class _LEGACY_ERROR_TEMP_1094
> 
>
> Key: SPARK-41314
> URL: https://issues.apache.org/jira/browse/SPARK-41314
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-39257) use spark.read.jdbc() to read data from SQL databse into dataframe, it fails silently, when the session is killed from SQL server side

2022-12-01 Thread Sandeep Katta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641736#comment-17641736
 ] 

Sandeep Katta edited comment on SPARK-39257 at 12/1/22 8:02 AM:


[~xinrantao]  I believe you are facing same issue as this 
[https://github.com/microsoft/mssql-jdbc/issues/1846.]  And this is fixed in 
the version [12.1.0 
|https://github.com/microsoft/mssql-jdbc/releases/tag/v12.1.0]  by the 
*mssql-jdbc* team using the PR 
[1942|https://github.com/microsoft/mssql-jdbc/pull/1942] . So you can use the 
mssql-jdbc jar with version 12.1.0 to fix this issue.

 


was (Author: sandeep.katta2007):
[~xinrantao]  I believe you are facing same issue as this 
[https://github.com/microsoft/mssql-jdbc/issues/1846.]  And this is fixed by 
the *mssql-jdbc* team using the PR 
[1942.|https://github.com/microsoft/mssql-jdbc/pull/1942] which is available in 
the release [12.1.0 
|https://github.com/microsoft/mssql-jdbc/releases/tag/v12.1.0] . You can this 
upgraded jar to solve this issue.

 

> use spark.read.jdbc() to read data from SQL databse into dataframe, it fails 
> silently, when the session is killed from SQL server side
> --
>
> Key: SPARK-39257
> URL: https://issues.apache.org/jira/browse/SPARK-39257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.2, 3.2.1
> Environment: {*}Spark version{*}: spark 3.0.1/3.1.2/3.2.1
> *Microsoft JDBC Driver* *for SQL server:* 
> mssql-jdbc-8.2.1.jre8/mssql-jdbc-10.2.1.jre8.jar
>Reporter: Xinran Tao
>Priority: Major
>
> I'm using *spark.read.jdbc()* to read form SQL database into a dataframe, 
> which utilizes *Microsoft JDBC Driver* *for SQL server* to get data from the 
> SQL server.
> *codes:*
>  
> {code:java}
> %scala
> val token = "xxx"
> val jdbcHostname = "xinrandatabseserver.database.windows.net"
> val jdbcDatabase = "xinranSQLDatabase"
> val jdbcPort = 1433
> val jdbcUrl = 
> "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname,
>  jdbcPort, jdbcDatabase)+ ";accessToken="
> import java.util.Properties
> val connectionProperties = new Properties()
> val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
> connectionProperties.setProperty("Driver", driverClass)
> connectionProperties.setProperty("accesstoken", token)
> val sql_pushdown = "(select UNITS from payment_balance_new) emp_alias"
> val df_stripe_dispute = spark.read.option("connectRetryCount", 
> 200).option("numPartitions",1).jdbc(url=jdbcUrl, table=sql_pushdown, 
> properties=connectionProperties)
> df_stripe_dispute.count()
> {code}
>  
>  
> The session was accidentally killed by some automatic scripts from SQL server 
> side, but no errors shows up from the spark side, no failure was observed. 
> But from the count() result, the reords are far less than it should be.
>  
> If I'm directly using *Microsoft JDBC Driver* *for SQL server* to run the 
> query and print the data out, which doesn't involve spark, there would be a 
> connection reset error thrown out.
> *codes:*
>  
> {code:java}
> %scala
> import java.sql.DriverManager
> import java.sql.Connection
> import java.util.Properties;
> val jdbcHostname = "xinrandatabseserver.database.windows.net"
> val jdbcDatabase = "xinranSQLDatabase"
> val jdbcPort = "1433"
> val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
> val token = ""
> val jdbcUrl = 
> "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname,
>  jdbcPort, jdbcDatabase)+ ";accessToken="+token
>  
> var connection:Connection = null
> val info:Properties = new Properties();
> info.setProperty("accesstoken", token);
>     
> // make the connection
> Class.forName(driver)
> connection = DriverManager.getConnection(jdbcUrl,info )
> // create the statement, and run the select query
> val statement = connection.createStatement()
> val resultSet = statement.executeQuery("select UNITS from 
> payment_balance_new")
> while ( resultSet.next() ) {
>   println("__"+resultSet.getString(1))
> }
> {code}
>  
> *errors:*
>  
> {code:java}
> com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset
> at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:2998)
>  at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2034) at 
> com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6446) at 
> com.microsoft.sqlserver.jdbc.TDSReader.nextPacket(IOBuffer.java:6396) at 
> com.microsoft.sqlserver.jdbc.TDSReader.ensurePayload(IOBuffer.java:6374) at 

  1   2   >