[jira] [Assigned] (SPARK-45407) Skip Unidoc in SparkR GitHub Action Job

2023-10-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45407:
-

Assignee: Dongjoon Hyun

> Skip Unidoc in SparkR GitHub Action Job
> ---
>
> Key: SPARK-45407
> URL: https://issues.apache.org/jira/browse/SPARK-45407
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45407) Skip Unidoc in SparkR GitHub Action Job

2023-10-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45407.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43208
[https://github.com/apache/spark/pull/43208]

> Skip Unidoc in SparkR GitHub Action Job
> ---
>
> Key: SPARK-45407
> URL: https://issues.apache.org/jira/browse/SPARK-45407
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45406) Delete schema from DataFrame constructor

2023-10-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45406.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43206
[https://github.com/apache/spark/pull/43206]

> Delete schema from DataFrame constructor
> 
>
> Key: SPARK-45406
> URL: https://issues.apache.org/jira/browse/SPARK-45406
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45406) Delete schema from DataFrame constructor

2023-10-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45406:


Assignee: Ruifeng Zheng

> Delete schema from DataFrame constructor
> 
>
> Key: SPARK-45406
> URL: https://issues.apache.org/jira/browse/SPARK-45406
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45409) Pin `torch<=2.0.1`

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45409:
---
Labels: pull-request-available  (was: )

> Pin `torch<=2.0.1`
> --
>
> Key: SPARK-45409
> URL: https://issues.apache.org/jira/browse/SPARK-45409
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43389) spark.read.csv throws NullPointerException when lineSep is set to None

2023-10-03 Thread Gera Shegalov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771698#comment-17771698
 ] 

Gera Shegalov commented on SPARK-43389:
---

There is a symmetrical issue on the DataFrameWriter side:
{code:python}
>>> spark.createDataFrame([('some value',),]).write.option('someOpt', 
>>> None).saveAsTable("hive_csv_t21")
{code}
 
{code:java}
23/10/03 21:39:12 WARN HiveExternalCatalog: Could not persist 
`spark_catalog`.`default`.`hive_csv_t21` in a Hive compatible way. Persisting 
it into Hive metastore in Spark SQL specific format.
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.lang.NullPointerException: Null values not allowed 
in persistent maps.)
    at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869)
    at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874)
    at 
org.apache.spark.sql.hive.client.Shim_v0_12.createTable(HiveShim.scala:614)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:573)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:571)
    at 
org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:526)
    at 
org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:415)
    at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
    at 
org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245)
    at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
    at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:402)
    at 
org.apache.spark.sql.rapids.shims.GpuCreateDataSourceTableAsSelectCommand.run(GpuCreateDataSourceTableAsSelectCommandShims.scala:91)
    at 
com.nvidia.spark.rapids.GpuExecutedCommandExec.sideEffectResult$lzycompute(GpuExecutedCommandExec.scala:52)
    at 
com.nvidia.spark.rapids.GpuExecutedCommandExec.sideEffectResult(GpuExecutedCommandExec.scala:50)
    at 
com.nvidia.spark.rapids.GpuExecutedCommandExec.executeCollect(GpuExecutedCommandExec.scala:61)
    at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107)
    at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
    at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
    at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
    at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
    at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107)
    at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
    at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437)
    at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98)
    at 
org.apache.spark

[jira] [Created] (SPARK-45409) Pin `torch<=2.0.1`

2023-10-03 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-45409:
-

 Summary: Pin `torch<=2.0.1`
 Key: SPARK-45409
 URL: https://issues.apache.org/jira/browse/SPARK-45409
 Project: Spark
  Issue Type: Test
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45408) [CORE] Add RPC SSL settings to TransportConf

2023-10-03 Thread Hasnain Lakhani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hasnain Lakhani updated SPARK-45408:

Summary: [CORE] Add RPC SSL settings to TransportConf  (was: Add RPC SSL 
settings to TransportConf)

> [CORE] Add RPC SSL settings to TransportConf
> 
>
> Key: SPARK-45408
> URL: https://issues.apache.org/jira/browse/SPARK-45408
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Priority: Major
>
> Add support for the settings for SSL RPC support to TransportConf and some 
> associated tests + sample configs used by other tests



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45408) Add RPC SSL settings to TransportConf

2023-10-03 Thread Hasnain Lakhani (Jira)
Hasnain Lakhani created SPARK-45408:
---

 Summary: Add RPC SSL settings to TransportConf
 Key: SPARK-45408
 URL: https://issues.apache.org/jira/browse/SPARK-45408
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Hasnain Lakhani


Add support for the settings for SSL RPC support to TransportConf and some 
associated tests + sample configs used by other tests



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45407) Skip Unidoc in SparkR GitHub Action Job

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45407:
---
Labels: pull-request-available  (was: )

> Skip Unidoc in SparkR GitHub Action Job
> ---
>
> Key: SPARK-45407
> URL: https://issues.apache.org/jira/browse/SPARK-45407
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45407) Skip Unidoc in SparkR GitHub Action Job

2023-10-03 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-45407:
-

 Summary: Skip Unidoc in SparkR GitHub Action Job
 Key: SPARK-45407
 URL: https://issues.apache.org/jira/browse/SPARK-45407
 Project: Spark
  Issue Type: Test
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43620) Support `Column` for SparkConnectColumn.__getitem__

2023-10-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43620.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43120
[https://github.com/apache/spark/pull/43120]

> Support `Column` for SparkConnectColumn.__getitem__
> ---
>
> Key: SPARK-43620
> URL: https://issues.apache.org/jira/browse/SPARK-43620
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Repro:
> {code:java}
> pser = pd.Series(["a", "b", "c"])
> psser = ps.from_pandas(pser)
> psser.astype("category")  # internally calls 
> `map_scol[self.spark.column]`{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43620) Support `Column` for SparkConnectColumn.__getitem__

2023-10-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43620:


Assignee: Haejoon Lee

> Support `Column` for SparkConnectColumn.__getitem__
> ---
>
> Key: SPARK-43620
> URL: https://issues.apache.org/jira/browse/SPARK-43620
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Repro:
> {code:java}
> pser = pd.Series(["a", "b", "c"])
> psser = ps.from_pandas(pser)
> psser.astype("category")  # internally calls 
> `map_scol[self.spark.column]`{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45351) Change RocksDB as default shuffle service db backend

2023-10-03 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-45351.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43142
[https://github.com/apache/spark/pull/43142]

> Change RocksDB as default shuffle service db backend
> 
>
> Key: SPARK-45351
> URL: https://issues.apache.org/jira/browse/SPARK-45351
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Jia Fan
>Assignee: Jia Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Change RocksDB as default shuffle service db backend, because we will remove 
> leveldb in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45351) Change RocksDB as default shuffle service db backend

2023-10-03 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-45351:


Assignee: Jia Fan

> Change RocksDB as default shuffle service db backend
> 
>
> Key: SPARK-45351
> URL: https://issues.apache.org/jira/browse/SPARK-45351
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Jia Fan
>Assignee: Jia Fan
>Priority: Major
>  Labels: pull-request-available
>
> Change RocksDB as default shuffle service db backend, because we will remove 
> leveldb in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45406) Delete schema from DataFrame constructor

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45406:
---
Labels: pull-request-available  (was: )

> Delete schema from DataFrame constructor
> 
>
> Key: SPARK-45406
> URL: https://issues.apache.org/jira/browse/SPARK-45406
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45406) Delete schema from DataFrame constructor

2023-10-03 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-45406:
-

 Summary: Delete schema from DataFrame constructor
 Key: SPARK-45406
 URL: https://issues.apache.org/jira/browse/SPARK-45406
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45405) Refactor Python UDTF execution

2023-10-03 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-45405:
-

 Summary: Refactor Python UDTF execution
 Key: SPARK-45405
 URL: https://issues.apache.org/jira/browse/SPARK-45405
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45283) Make StatusTrackerSuite less fragile

2023-10-03 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45283.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43194
[https://github.com/apache/spark/pull/43194]

> Make StatusTrackerSuite less fragile
> 
>
> Key: SPARK-45283
> URL: https://issues.apache.org/jira/browse/SPARK-45283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It's discovered from [Github 
> Actions|https://github.com/xiongbo-sjtu/spark/actions/runs/6270601155/job/17028788767]
>  that StatusTrackerSuite can run into random failures, as shown by the 
> following stack trace (highlighted in red).  The proposed fix is to update 
> the unit test to remove the nondeterministic behavior.
> {quote}[info] StatusTrackerSuite:
> [info] - basic status API usage (99 milliseconds)
> [info] - getJobIdsForGroup() (56 milliseconds)
> [info] - getJobIdsForGroup() with takeAsync() (48 milliseconds)
> [info] - getJobIdsForGroup() with takeAsync() across multiple partitions (58 
> milliseconds)
> [info] - getJobIdsForTag() *** FAILED *** (10 seconds, 77 milliseconds)
> {color:#ff}[info] The code passed to eventually never returned normally. 
> Attempted 651 times over 10.00505994401 seconds. Last failure message: 
> Set(3, 2, 1) was not equal to Set(1, 2). (StatusTrackerSuite.scala:148){color}
> [info] org.scalatest.exceptions.TestFailedDueToTimeoutException:
> [info] at 
> org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219)
> [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
> [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:348)
> [info] at 
> org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:347)
> [info] at 
> org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457)
> [info] at 
> org.apache.spark.StatusTrackerSuite.$anonfun$new$21(StatusTrackerSuite.scala:148)
> [info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
> [info] at 
> org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282)
> [info] at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231)
> [info] at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230)
> [info] at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69)
> [info] at 
> org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155)
> [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:227)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> [info] at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:69)
> [info] at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
> [info] at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
> [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:69)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> [info] at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> [info] at scala.collection.immutable.List.foreach(List.scala:333)
> [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
> [info] at org.scalatest.Suite.ru

[jira] [Assigned] (SPARK-45283) Make StatusTrackerSuite less fragile

2023-10-03 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45283:
---

Assignee: Bo Xiong

> Make StatusTrackerSuite less fragile
> 
>
> Key: SPARK-45283
> URL: https://issues.apache.org/jira/browse/SPARK-45283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It's discovered from [Github 
> Actions|https://github.com/xiongbo-sjtu/spark/actions/runs/6270601155/job/17028788767]
>  that StatusTrackerSuite can run into random failures, as shown by the 
> following stack trace (highlighted in red).  The proposed fix is to update 
> the unit test to remove the nondeterministic behavior.
> {quote}[info] StatusTrackerSuite:
> [info] - basic status API usage (99 milliseconds)
> [info] - getJobIdsForGroup() (56 milliseconds)
> [info] - getJobIdsForGroup() with takeAsync() (48 milliseconds)
> [info] - getJobIdsForGroup() with takeAsync() across multiple partitions (58 
> milliseconds)
> [info] - getJobIdsForTag() *** FAILED *** (10 seconds, 77 milliseconds)
> {color:#ff}[info] The code passed to eventually never returned normally. 
> Attempted 651 times over 10.00505994401 seconds. Last failure message: 
> Set(3, 2, 1) was not equal to Set(1, 2). (StatusTrackerSuite.scala:148){color}
> [info] org.scalatest.exceptions.TestFailedDueToTimeoutException:
> [info] at 
> org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219)
> [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
> [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:348)
> [info] at 
> org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:347)
> [info] at 
> org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457)
> [info] at 
> org.apache.spark.StatusTrackerSuite.$anonfun$new$21(StatusTrackerSuite.scala:148)
> [info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
> [info] at 
> org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282)
> [info] at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231)
> [info] at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230)
> [info] at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69)
> [info] at 
> org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155)
> [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:227)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> [info] at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:69)
> [info] at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
> [info] at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
> [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:69)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> [info] at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> [info] at scala.collection.immutable.List.foreach(List.scala:333)
> [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
> [info] at org.scalatest.Suite.run(Suite.scala:1114)
> [info] at org.scalatest.Suite.run$(Suite.scala:1096)
> [info] at 
> org.scalatest.funsuite.AnyFunSuite.org$scalate

[jira] [Assigned] (SPARK-45376) [CORE] Add netty-tcnative-boringssl-static dependency

2023-10-03 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45376:
---

Assignee: Hasnain Lakhani

> [CORE] Add netty-tcnative-boringssl-static dependency
> -
>
> Key: SPARK-45376
> URL: https://issues.apache.org/jira/browse/SPARK-45376
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> Add the boringssl dependency which is needed for SSL functionality to work, 
> and provide the network common test helper to other test modules which need 
> to test SSL functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45376) [CORE] Add netty-tcnative-boringssl-static dependency

2023-10-03 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45376.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43164
[https://github.com/apache/spark/pull/43164]

> [CORE] Add netty-tcnative-boringssl-static dependency
> -
>
> Key: SPARK-45376
> URL: https://issues.apache.org/jira/browse/SPARK-45376
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add the boringssl dependency which is needed for SSL functionality to work, 
> and provide the network common test helper to other test modules which need 
> to test SSL functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45399) XML: Add XML Options using newOption

2023-10-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45399:


Assignee: Sandip Agarwala

> XML: Add XML Options using newOption
> 
>
> Key: SPARK-45399
> URL: https://issues.apache.org/jira/browse/SPARK-45399
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Assignee: Sandip Agarwala
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45399) XML: Add XML Options using newOption

2023-10-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45399.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43201
[https://github.com/apache/spark/pull/43201]

> XML: Add XML Options using newOption
> 
>
> Key: SPARK-45399
> URL: https://issues.apache.org/jira/browse/SPARK-45399
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Assignee: Sandip Agarwala
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45347) Include SparkThrowable in FetchErrorDetailsResponse

2023-10-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45347.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43136
[https://github.com/apache/spark/pull/43136]

> Include SparkThrowable in FetchErrorDetailsResponse
> ---
>
> Key: SPARK-45347
> URL: https://issues.apache.org/jira/browse/SPARK-45347
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Yihong He
>Assignee: Yihong He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45347) Include SparkThrowable in FetchErrorDetailsResponse

2023-10-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45347:


Assignee: Yihong He

> Include SparkThrowable in FetchErrorDetailsResponse
> ---
>
> Key: SPARK-45347
> URL: https://issues.apache.org/jira/browse/SPARK-45347
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Yihong He
>Assignee: Yihong He
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45404) Support AWS_ENDPOINT_URL env variable

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45404:
---
Labels: pull-request-available  (was: )

> Support AWS_ENDPOINT_URL env variable
> -
>
> Key: SPARK-45404
> URL: https://issues.apache.org/jira/browse/SPARK-45404
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45404) Support AWS_ENDPOINT_URL env variable

2023-10-03 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-45404:
-

 Summary: Support AWS_ENDPOINT_URL env variable
 Key: SPARK-45404
 URL: https://issues.apache.org/jira/browse/SPARK-45404
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45403) Spark SQL returns table column names as literal data values for Hive tables

2023-10-03 Thread Reece Robinson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reece Robinson updated SPARK-45403:
---
Description: 
When using Spark SQL and Hive JDBC driver to access a Hive table the resulting 
row data is replaced with the literal column name in the resulting dataframe 
result.

When I run this:

jdbcDF = spark.read \
  .format("jdbc") \
  .options(driver="org.apache.hive.jdbc.HiveDriver",
           url="jdbc:hive2://10.20.174.171:10009",
           user="10009",
           password="123",
           query="select * from demo.hospitals limit 10"
           ) \
  .load()

 

I get:

+```+

++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+

+```+

I should see:

```

++++-+-+---++++--+---+--+-+-+
 | person_pk| race_value|sex_code|poverty_value|veteran_value|ppr_pro| 
patient_pk| di_dk| pov_pk|vet_pk|veteran|total_paid|num_drugs|immunized| 
++++-+-+---++++--+---+--+-+-+
 |001252a7-a1e7-428...|01 - American Ind...| F| 37.0| null| 
2|65007233-424e-4c2...|9d66f5b7-ab10-47f...|1f3d76c8-d039-483...| |unknown| 
null| null| true| |002673d4-579a-4d1...|01 - American Ind...| M| 64.0| null| 
2|a3c89a7f-d57d-4be...|2f6ffa09-e5b3-419...|7dbfc730-64bc-4a9...| |unknown| 
null| null| true| |00267822-8192-44f...|01 - American Ind...| F| 0.0| null| 
2|cd318b72-35d4-422...|44646492-60ef-44e...|d5f462ef-cd4c-497...| |unknown| 
null| null| true| |0028fece-59ec-4db...|01 - American Ind...| F| 0.0| null| 
2|ee9e09aa-67bc-47e...|3be068de-7fe3-44d...|63a04010-c381-4aa...| |unknown| 
null| null| true| |003470e7-b548-444...|06 - American Ind...| M| 171.0| null| 
2|7ed5b0f9-02b3-459...|1b778c9f-71ab-45a...|84ecc23a-6c39-44d...| |unknown| 
null| null| false| |0044a493-e226-409...|01 - American Ind...| F| 0.0| null| 
2|c821f5b2-d0af-428...|26144dac-81f0-44e...|f7355eeb-89a3-4f0...| |unknown| 
null| null| true| |004d44d0-fdf7-403...|01 - American Ind...| F| 37.0| null| 
2|cb6c8e5c-71ab-409...|88eaf3c4-5f00-4e9...|78679644-f4e7-450...| |unknown| 
null| null| true| |0059c1bf-5263-42a...|03 - Black or Afr...| M| 0.0| null| 
2|da9247d1-96fb-44d...|6831544a-faf9-426...|3534f3a8-a367-41e...| |unknown| 
null| null| true| |007b82b6-ae2e-49e...|01 - American Ind...| M| 43.0| null| 
2|3e6fcc8c-c484-465...|90e2a03f-f0a4-48f...|5c9c71e1-901b-481...| |unknown| 
null| null| t

[jira] [Updated] (SPARK-45403) Spark SQL returns table column names as literal data values for Hive tables

2023-10-03 Thread Reece Robinson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reece Robinson updated SPARK-45403:
---
Description: 
When using Spark SQL and Hive JDBC driver to access a Hive table the resulting 
row data is replaced with the literal column name in the resulting dataframe 
result.

When I run this:

jdbcDF = spark.read \
  .format("jdbc") \
  .options(driver="org.apache.hive.jdbc.HiveDriver",
           url="jdbc:hive2://10.20.174.171:10009",
           user="10009",
           password="123",
           query="select * from demo.hospitals limit 10"
           ) \
  .load()

 

I get:

{+}---{-}{-}{+}-{-}++{-}--{-}{-}-{-}++{-}--{-}{-}---{-}++{-}-{-}{-}-{-}++{-}-{-}{-}-{-}++{-}---{-}{-}{-}++{-}--{-}{-}---{-}++{-}--{-}{-}-{-}++{-}-{-}{-}---{-}++{-}-{-}{-}{-}+
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
++{-}--{-}{-}-{-}++{-}--{-}{-}-{-}++{-}--{-}{-}---{-}++{-}-{-}{-}-{-}++{-}-{-}{-}-{-}++{-}---{-}{-}{-}++{-}--{-}{-}---{-}++{-}--{-}{-}-{-}++{-}-{-}{-}---{-}++{-}-{-}{-}{-}+
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
++{-}--{-}{-}-{-}++{-}--{-}{-}-{-}++{-}--{-}{-}---{-}++{-}-{-}{-}-{-}++{-}-{-}{-}-{-}++{-}---{-}{-}{-}++{-}--{-}{-}---{-}++{-}--{-}{-}-{-}++{-}-{-}{-}---{-}++{-}-{-}{-}-+

 

I should see:

{+}---{-}{-}{+}--{-}++{-}--{-}{-}---{-}++{-}---{-}{-}-{-}++{-}--{-}{-}--{-}++{-}--{-}{-}{-}++{-}-{-}{-}{-}++{-}---{-}{-}---{-}+
 | person_pk| race_value|sex_code|poverty_value|veteran_value|ppr_pro| 
patient_pk| di_dk| pov_pk|vet_pk|veteran|total_paid|num_drugs|immunized| 
++{-}--{-}{-}--{-}++{-}--{-}{-}---{-}++{-}---{-}{-}-{-}++{-}--{-}{-}--{-}++{-}--{-}{-}{-}++{-}-{-}{-}{-}++{-}---{-}{-}---{-}+
 |001252a7-a1e7-428...|01 - American Ind...| F| 37.0| null| 
2|65007233-424e-4c2...|9d66f5b7-ab10-47f...|1f3d76c8-d039-483...| |unknown| 
null| null| true| |002673d4-579a-4d1...|01 - American Ind...| M| 64.0| null| 
2|a3c89a7f-d57d-4be...|2f6ffa09-e5b3-419...|7dbfc730-64bc-4a9...| |unknown| 
null| null| true| |00267822-8192-44f...|01 - American Ind...| F| 0.0| null| 
2|cd318b72-35d4-422...|44646492-60ef-44e...|d5f462ef-cd4c-497...| |unknown| 
null| null| true| |0028fece-59ec-4db...|01 - American Ind...| F| 0.0| null| 
2|ee9e09aa-67bc-47e...|3be068de-7fe3-44d...|63a04010-c381-4aa...| |unknown| 
null| null| true| |003470e7-b548-444...|06 - American Ind...| M| 171.0| null| 
2|7ed5b0f9-02b3-459...|1b778c9f-71ab-45a...|84ecc23a-6c39-44d...| |unknown| 
null| null| false| |0044a493-e226-409...|01 - American Ind...| F| 0.0| null| 
2|c821f5b2-d0af-428...|26144dac-81f0-44e...|f7355eeb-89a3-4f0...| |unknown| 
null| null| true| |004d44d0-fdf7-403...|01 - American Ind...| F| 37.0| null| 
2|cb6c8e5c-71ab-409...|88eaf3c4-5f00-4e9...|78679644-f4e7-450

[jira] [Updated] (SPARK-45403) Spark SQL returns table column names as literal data values for Hive tables

2023-10-03 Thread Reece Robinson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reece Robinson updated SPARK-45403:
---
Attachment: Screenshot 2023-10-04 at 11.11.28 AM.png

> Spark SQL returns table column names as literal data values for Hive tables
> ---
>
> Key: SPARK-45403
> URL: https://issues.apache.org/jira/browse/SPARK-45403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
> Environment: I am using Spark 3.4.0 however this has been an issue 
> for years.
>Reporter: Reece Robinson
>Priority: Major
> Attachments: Screenshot 2023-10-04 at 11.11.28 AM.png
>
>
> When using Spark SQL and Hive JDBC driver to access a Hive table the 
> resulting row data is replaced with the literal column name in the resulting 
> dataframe result.
> When I run this:
> jdbcDF = spark.read \
>   .format("jdbc") \
>   .options(driver="org.apache.hive.jdbc.HiveDriver",
>            url="jdbc:hive2://10.20.174.171:10009",
>            user="10009",
>            password="123",
>            query="select * from demo.hospitals limit 10"
>            ) \
>   .load()
>  
> I get:
> ++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+
>  
> |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
>  
> ++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+
>  
> |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
>  
> |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
>  
> |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
>  
> |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
>  
> |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
>  
> |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
>  
> |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
>  
> |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
>  
> |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
>  
> |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
>  
> ++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+
>  
> I should see:
> ++++-+-+---++++--+---+--+-+-+
>  | person_pk| race_value|sex_code|poverty_value|veteran_value|ppr_pro| 
> patient_pk| di_dk| pov_pk|vet_pk|veteran|total_paid|num_drugs|immunized| 
> ++++-+-+---++++--+---+--+-+-+
>  |001252a7-a1e7-428...|01 - American Ind...| F| 37.0| null| 
> 2|65007233-424e-4c2...|9d66f5b7-ab10-47f...|1f3d76c8-d039-483...| |unknown| 
> null| null| true| |002673d4-579a-4d1...|01 - American Ind...| M| 64.0| null| 
> 2|a3c89a7f-d57d-4be...|2f6ffa09-e5b3-419...|7dbfc730-64bc-4a9...| |unknown| 
> null| null| true| |00267822-8192-44f...|01 - American Ind...| F| 0.0| null| 
> 2|cd318b72-35d4-422...|44646492-60ef-44e...|d5f462ef-cd4c-497...| |unknown| 
> null| null| true| |0028fece-59ec-4db...|01 - American Ind...| F| 0.0| null| 
> 2|ee9e09aa-67bc-47e...|3be068de-7fe3-44d...|63a04010-c381-4aa...| |unknown| 
> null| null| true| |003470e7-b548-444...|06 - American 

[jira] [Created] (SPARK-45403) Spark SQL returns table column names as literal data values for Hive tables

2023-10-03 Thread Reece Robinson (Jira)
Reece Robinson created SPARK-45403:
--

 Summary: Spark SQL returns table column names as literal data 
values for Hive tables
 Key: SPARK-45403
 URL: https://issues.apache.org/jira/browse/SPARK-45403
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
 Environment: I am using Spark 3.4.0 however this has been an issue for 
years.
Reporter: Reece Robinson


When using Spark SQL and Hive JDBC driver to access a Hive table the resulting 
row data is replaced with the literal column name in the resulting dataframe 
result.

When I run this:

jdbcDF = spark.read \
  .format("jdbc") \
  .options(driver="org.apache.hive.jdbc.HiveDriver",
           url="jdbc:hive2://10.20.174.171:10009",
           user="10009",
           password="123",
           query="select * from demo.hospitals limit 10"
           ) \
  .load()

 

I get:

++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
|provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK|
 
++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+

 

I should see:

++++-+-+---++++--+---+--+-+-+
 | person_pk| race_value|sex_code|poverty_value|veteran_value|ppr_pro| 
patient_pk| di_dk| pov_pk|vet_pk|veteran|total_paid|num_drugs|immunized| 
++++-+-+---++++--+---+--+-+-+
 |001252a7-a1e7-428...|01 - American Ind...| F| 37.0| null| 
2|65007233-424e-4c2...|9d66f5b7-ab10-47f...|1f3d76c8-d039-483...| |unknown| 
null| null| true| |002673d4-579a-4d1...|01 - American Ind...| M| 64.0| null| 
2|a3c89a7f-d57d-4be...|2f6ffa09-e5b3-419...|7dbfc730-64bc-4a9...| |unknown| 
null| null| true| |00267822-8192-44f...|01 - American Ind...| F| 0.0| null| 
2|cd318b72-35d4-422...|44646492-60ef-44e...|d5f462ef-cd4c-497...| |unknown| 
null| null| true| |0028fece-59ec-4db...|01 - American Ind...| F| 0.0| null| 
2|ee9e09aa-67bc-47e...|3be068de-7fe3-44d...|63a04010-c381-4aa...| |unknown| 
null| null| true| |003470e7-b548-444...|06 - American Ind...| M| 171.0| null| 
2|7ed5b0f9-02b3-459...|1b778c9f-71ab-45a...|84ecc23a-6c39-44d...| |unknown| 
null| null| false| |0044a493-e226-409...|01 - American Ind...| F| 0.0| null| 
2|c821f5b2-d0af-428...|26144dac-81f0-44e...|f7355eeb-89a3-4f0...| |unknown| 
null| null| true| |004d44d0-fdf7-403...|01 - American Ind...| F| 37.0| null| 
2|cb6c8e5c-71ab-409...|88eaf3c4-5f00-4e9...|78679644-f4e7-450...| |unknown| 
null| null| true| |0059c1bf-5263-42a...|03

[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table

2023-10-03 Thread Reece Robinson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reece Robinson updated SPARK-27943:
---
Attachment: (was: Screenshot 2023-10-04 at 11.11.28 AM.png)

> Implement default constraint with Column for Hive table
> ---
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jiaan Geng
>Priority: Major
>
>  
>  *Background*
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
> *Design*
> Hive is widely used in production environments and is the standard in the 
> field of big data in fact.
> But Hive exists many version used in production and the feature between each 
> version are different.
> Spark SQL need to implement default constraint, but there are three points to 
> pay attention to in design:
> _First_, Spark SQL should reduce coupling with Hive.
> _Second_, default constraint could compatible with different versions of Hive.
> _Thrid_, Which expression of default constraint should Spark SQL support? I 
> think should support `literal`, `current_date()`, `current_timestamp()`. 
> Maybe other expression should also supported, like `Cast(1 as float)`, `1 + 
> 2` and so on.
> We want to save the metadata of default constraint into properties of Hive 
> table, and then we restore metadata from the properties after client gets 
> newest metadata.The implement is the same as other metadata (e.g. 
> partition,bucket,statistics).
> Because default constraint is part of column, so I think could reuse the 
> metadata of StructField. The default constraint will cached by metadata of 
> StructField.
>  
> *Tasks*
> This is a big work, wo I want to split this work into some sub tasks, as 
> follows:
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table

2023-10-03 Thread Reece Robinson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reece Robinson updated SPARK-27943:
---
Attachment: Screenshot 2023-10-04 at 11.11.28 AM.png

> Implement default constraint with Column for Hive table
> ---
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jiaan Geng
>Priority: Major
> Attachments: Screenshot 2023-10-04 at 11.11.28 AM.png
>
>
>  
>  *Background*
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
> *Design*
> Hive is widely used in production environments and is the standard in the 
> field of big data in fact.
> But Hive exists many version used in production and the feature between each 
> version are different.
> Spark SQL need to implement default constraint, but there are three points to 
> pay attention to in design:
> _First_, Spark SQL should reduce coupling with Hive.
> _Second_, default constraint could compatible with different versions of Hive.
> _Thrid_, Which expression of default constraint should Spark SQL support? I 
> think should support `literal`, `current_date()`, `current_timestamp()`. 
> Maybe other expression should also supported, like `Cast(1 as float)`, `1 + 
> 2` and so on.
> We want to save the metadata of default constraint into properties of Hive 
> table, and then we restore metadata from the properties after client gets 
> newest metadata.The implement is the same as other metadata (e.g. 
> partition,bucket,statistics).
> Because default constraint is part of column, so I think could reuse the 
> metadata of StructField. The default constraint will cached by metadata of 
> StructField.
>  
> *Tasks*
> This is a big work, wo I want to split this work into some sub tasks, as 
> follows:
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45402) Add API for 'analyze' method to return a buffer to be consumed on each class creation

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45402:
---
Labels: pull-request-available  (was: )

> Add API for 'analyze' method to return a buffer to be consumed on each class 
> creation
> -
>
> Key: SPARK-45402
> URL: https://issues.apache.org/jira/browse/SPARK-45402
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Daniel
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45402) Add API for 'analyze' method to return a buffer to be consumed on each class creation

2023-10-03 Thread Daniel (Jira)
Daniel created SPARK-45402:
--

 Summary: Add API for 'analyze' method to return a buffer to be 
consumed on each class creation
 Key: SPARK-45402
 URL: https://issues.apache.org/jira/browse/SPARK-45402
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Daniel






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45400) Refer to the unescaping rules from expression descriptions

2023-10-03 Thread Max Gekk (Jira)
Max Gekk created SPARK-45400:


 Summary: Refer to the unescaping rules from expression descriptions
 Key: SPARK-45400
 URL: https://issues.apache.org/jira/browse/SPARK-45400
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 4.0.0
Reporter: Max Gekk
Assignee: Max Gekk


Update the expression/function description and refer to the unescaping rules in 
the items where regexp parameters are described. This should less confuse users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44219) Add extra per-rule validation for optimization rewrites.

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44219:
---
Labels: pull-request-available  (was: )

> Add extra per-rule validation for optimization rewrites.
> 
>
> Key: SPARK-44219
> URL: https://issues.apache.org/jira/browse/SPARK-44219
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Yannis Sismanis
>Priority: Major
>  Labels: pull-request-available
>
> Adds per-rule validation checks for the following:
> 1.  aggregate expressions in Aggregate plans are valid.
> 2. Grouping key types in Aggregate plans cannot by of type Map. 
> 3. No dangling references have been generated.
> This is validation is by default enabled for all tests or selectively using 
> the spark.sql.planChangeValidation=true flag.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45136) Improve ClosureCleaner to support closures defined in Ammonite REPL

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45136:
---
Labels: pull-request-available  (was: )

> Improve ClosureCleaner to support closures defined in Ammonite REPL
> ---
>
> Key: SPARK-45136
> URL: https://issues.apache.org/jira/browse/SPARK-45136
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Vsevolod Stepanov
>Priority: Major
>  Labels: pull-request-available
>
> ConnectRepl uses Ammonite REPL with  CodeClassWrapper to run Scala code. It 
> means that each code cell is wrapped into a separate object. If there are 
> multiple variables defined in the same cell / code block it will lead to 
> capturing extra variables, increasing serialized UDF payload size or making 
> it non-serializable.
> For example, this code
> {code:java}
> // cell 1 
> {
>   val x = 100
>   val y = new NonSerializable
> }
> // cell 2
> spark.range(10).map(i => i + x).agg(sum("value")).collect(){code}
> will fail because lambda will capture both `x` and `y` as they're defined in 
> the same wrapper object



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45399) XML: Add XML Options using newOption

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45399:
---
Labels: pull-request-available  (was: )

> XML: Add XML Options using newOption
> 
>
> Key: SPARK-45399
> URL: https://issues.apache.org/jira/browse/SPARK-45399
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45399) XML: Add XML Options using newOption

2023-10-03 Thread Sandip Agarwala (Jira)
Sandip Agarwala created SPARK-45399:
---

 Summary: XML: Add XML Options using newOption
 Key: SPARK-45399
 URL: https://issues.apache.org/jira/browse/SPARK-45399
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Sandip Agarwala






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45398) Include `ESCAPE` to `sql()` of `Like`

2023-10-03 Thread Max Gekk (Jira)
Max Gekk created SPARK-45398:


 Summary: Include `ESCAPE` to `sql()` of `Like`
 Key: SPARK-45398
 URL: https://issues.apache.org/jira/browse/SPARK-45398
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Max Gekk
Assignee: Max Gekk


Fix the `sql()` method of the `Like` expression and append the `ESCAPE` 
closure. That should become consistent to `toString` and fix the issue:

{code:sql}
spark-sql (default)> create temp view tbl as (SELECT 'a|_' like 'a||_' escape 
'|', 'a|_' like 'a||_' escape 'a');
[COLUMN_ALREADY_EXISTS] The column `a|_ like a||_` already exists. Consider to 
choose another name or rename the existing column.
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37467) Consolidate whole stage and non-whole stage subexpression elimination

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-37467:
---
Labels: pull-request-available  (was: )

> Consolidate whole stage and non-whole stage subexpression elimination
> -
>
> Key: SPARK-37467
> URL: https://issues.apache.org/jira/browse/SPARK-37467
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Adam Binford
>Priority: Major
>  Labels: pull-request-available
>
> Currently there are separate subexpression elimination paths for whole stage 
> and non-whole stage codegen. Consolidating these into a single code path 
> would make it simpler to add further enhancements, such as supporting lambda 
> functions  (https://issues.apache.org/jira/browse/SPARK-37466).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44848) MLlib GBTClassifier has wrong impurity method 'variance' instead of 'gini' or 'entropy'.

2023-10-03 Thread Oumar Nour (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771433#comment-17771433
 ] 

Oumar Nour commented on SPARK-44848:


Hello,

I have the same issue. I want to know if that issue is solved ?

Thanks

> MLlib GBTClassifier has wrong impurity method 'variance' instead of 'gini' or 
> 'entropy'. 
> -
>
> Key: SPARK-44848
> URL: https://issues.apache.org/jira/browse/SPARK-44848
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.4.1
>Reporter: Elisabeth Niederbacher
>Priority: Major
>
> Impurity method 'variance' should only be used for regressors, *not* 
> classifiers. For classifiers gini and entropy should be available as it is 
> already the case for the RandomForestClassifier 
> [https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html]
>  .
> Because of this bug 'minInfoGain' hyperparameter cannot be tuned to combat 
> overfitting. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to d

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-34591:
---
Labels: pull-request-available pyspark  (was: pyspark)

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> --
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.4, 3.1.1
>Reporter: Julian King
>Priority: Major
>  Labels: pull-request-available, pyspark
> Attachments: Reproducible example of Spark bug - no 2.pdf, 
> Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> certain behaviours - we had to trace back through all of the Spark detailed 
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The proposed solution here is:
>  * Set the default pruning behaviour to False rather than True, thereby 
> bringing the default behaviour back into alignment with the documentation 
> whilst avoiding the issues described above
> +Option 2 (more involved):+
> The proposed solution here is:
>  * Leave the default pruning behaviour set to False
>  * Expand the pyspark API to expose the pruning behaviour as a 
> user-controllable option
>  * Document the change to the API
>  * Document the change to the tree building behaviour at appropriate points 
> in the Spark ML and Spark MLLib documentation
> We recommend that the default behaviour be set to False because this approach 
> is not the generally understood approach for building decision trees, where 
> pruning is decided a separate and user-controllable step.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-

[jira] [Updated] (SPARK-45397) Add vector assembler feature transformer

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45397:
---
Labels: pull-request-available  (was: )

> Add vector assembler feature transformer
> 
>
> Key: SPARK-45397
> URL: https://issues.apache.org/jira/browse/SPARK-45397
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.1
>Reporter: Weichen Xu
>Priority: Major
>  Labels: pull-request-available
>
> Add vector assembler feature transformer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45355) Fix function groups in Scala Doc

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45355:
--

Assignee: (was: Apache Spark)

> Fix function groups in Scala Doc
> 
>
> Key: SPARK-45355
> URL: https://issues.apache.org/jira/browse/SPARK-45355
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45355) Fix function groups in Scala Doc

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45355:
--

Assignee: Apache Spark

> Fix function groups in Scala Doc
> 
>
> Key: SPARK-45355
> URL: https://issues.apache.org/jira/browse/SPARK-45355
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45355) Fix function groups in Scala Doc

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45355:
--

Assignee: Apache Spark

> Fix function groups in Scala Doc
> 
>
> Key: SPARK-45355
> URL: https://issues.apache.org/jira/browse/SPARK-45355
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45355) Fix function groups in Scala Doc

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45355:
--

Assignee: (was: Apache Spark)

> Fix function groups in Scala Doc
> 
>
> Key: SPARK-45355
> URL: https://issues.apache.org/jira/browse/SPARK-45355
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45355) Fix function groups in Scala Doc

2023-10-03 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45355:
---
Summary: Fix function groups in Scala Doc  (was: Re group functions in 
scala doc)

> Fix function groups in Scala Doc
> 
>
> Key: SPARK-45355
> URL: https://issues.apache.org/jira/browse/SPARK-45355
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45355) Re group functions in scala doc

2023-10-03 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45355:
---
Summary: Re group functions in scala doc  (was: Fix function groups in 
Scala Doc)

> Re group functions in scala doc
> ---
>
> Key: SPARK-45355
> URL: https://issues.apache.org/jira/browse/SPARK-45355
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45397) Add vector assembler feature transformer

2023-10-03 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-45397:
--

 Summary: Add vector assembler feature transformer
 Key: SPARK-45397
 URL: https://issues.apache.org/jira/browse/SPARK-45397
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, ML, PySpark
Affects Versions: 3.5.1
Reporter: Weichen Xu


Add vector assembler feature transformer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org