[jira] [Assigned] (SPARK-45407) Skip Unidoc in SparkR GitHub Action Job
[ https://issues.apache.org/jira/browse/SPARK-45407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45407: - Assignee: Dongjoon Hyun > Skip Unidoc in SparkR GitHub Action Job > --- > > Key: SPARK-45407 > URL: https://issues.apache.org/jira/browse/SPARK-45407 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45407) Skip Unidoc in SparkR GitHub Action Job
[ https://issues.apache.org/jira/browse/SPARK-45407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45407. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43208 [https://github.com/apache/spark/pull/43208] > Skip Unidoc in SparkR GitHub Action Job > --- > > Key: SPARK-45407 > URL: https://issues.apache.org/jira/browse/SPARK-45407 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45406) Delete schema from DataFrame constructor
[ https://issues.apache.org/jira/browse/SPARK-45406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45406. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43206 [https://github.com/apache/spark/pull/43206] > Delete schema from DataFrame constructor > > > Key: SPARK-45406 > URL: https://issues.apache.org/jira/browse/SPARK-45406 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45406) Delete schema from DataFrame constructor
[ https://issues.apache.org/jira/browse/SPARK-45406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45406: Assignee: Ruifeng Zheng > Delete schema from DataFrame constructor > > > Key: SPARK-45406 > URL: https://issues.apache.org/jira/browse/SPARK-45406 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45409) Pin `torch<=2.0.1`
[ https://issues.apache.org/jira/browse/SPARK-45409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45409: --- Labels: pull-request-available (was: ) > Pin `torch<=2.0.1` > -- > > Key: SPARK-45409 > URL: https://issues.apache.org/jira/browse/SPARK-45409 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43389) spark.read.csv throws NullPointerException when lineSep is set to None
[ https://issues.apache.org/jira/browse/SPARK-43389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771698#comment-17771698 ] Gera Shegalov commented on SPARK-43389: --- There is a symmetrical issue on the DataFrameWriter side: {code:python} >>> spark.createDataFrame([('some value',),]).write.option('someOpt', >>> None).saveAsTable("hive_csv_t21") {code} {code:java} 23/10/03 21:39:12 WARN HiveExternalCatalog: Could not persist `spark_catalog`.`default`.`hive_csv_t21` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format. org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.NullPointerException: Null values not allowed in persistent maps.) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874) at org.apache.spark.sql.hive.client.Shim_v0_12.createTable(HiveShim.scala:614) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:573) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:571) at org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:526) at org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:415) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:402) at org.apache.spark.sql.rapids.shims.GpuCreateDataSourceTableAsSelectCommand.run(GpuCreateDataSourceTableAsSelectCommandShims.scala:91) at com.nvidia.spark.rapids.GpuExecutedCommandExec.sideEffectResult$lzycompute(GpuExecutedCommandExec.scala:52) at com.nvidia.spark.rapids.GpuExecutedCommandExec.sideEffectResult(GpuExecutedCommandExec.scala:50) at com.nvidia.spark.rapids.GpuExecutedCommandExec.executeCollect(GpuExecutedCommandExec.scala:61) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98) at org.apache.spark
[jira] [Created] (SPARK-45409) Pin `torch<=2.0.1`
Dongjoon Hyun created SPARK-45409: - Summary: Pin `torch<=2.0.1` Key: SPARK-45409 URL: https://issues.apache.org/jira/browse/SPARK-45409 Project: Spark Issue Type: Test Components: Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45408) [CORE] Add RPC SSL settings to TransportConf
[ https://issues.apache.org/jira/browse/SPARK-45408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hasnain Lakhani updated SPARK-45408: Summary: [CORE] Add RPC SSL settings to TransportConf (was: Add RPC SSL settings to TransportConf) > [CORE] Add RPC SSL settings to TransportConf > > > Key: SPARK-45408 > URL: https://issues.apache.org/jira/browse/SPARK-45408 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hasnain Lakhani >Priority: Major > > Add support for the settings for SSL RPC support to TransportConf and some > associated tests + sample configs used by other tests -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45408) Add RPC SSL settings to TransportConf
Hasnain Lakhani created SPARK-45408: --- Summary: Add RPC SSL settings to TransportConf Key: SPARK-45408 URL: https://issues.apache.org/jira/browse/SPARK-45408 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 4.0.0 Reporter: Hasnain Lakhani Add support for the settings for SSL RPC support to TransportConf and some associated tests + sample configs used by other tests -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45407) Skip Unidoc in SparkR GitHub Action Job
[ https://issues.apache.org/jira/browse/SPARK-45407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45407: --- Labels: pull-request-available (was: ) > Skip Unidoc in SparkR GitHub Action Job > --- > > Key: SPARK-45407 > URL: https://issues.apache.org/jira/browse/SPARK-45407 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45407) Skip Unidoc in SparkR GitHub Action Job
Dongjoon Hyun created SPARK-45407: - Summary: Skip Unidoc in SparkR GitHub Action Job Key: SPARK-45407 URL: https://issues.apache.org/jira/browse/SPARK-45407 Project: Spark Issue Type: Test Components: Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43620) Support `Column` for SparkConnectColumn.__getitem__
[ https://issues.apache.org/jira/browse/SPARK-43620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43620. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43120 [https://github.com/apache/spark/pull/43120] > Support `Column` for SparkConnectColumn.__getitem__ > --- > > Key: SPARK-43620 > URL: https://issues.apache.org/jira/browse/SPARK-43620 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Repro: > {code:java} > pser = pd.Series(["a", "b", "c"]) > psser = ps.from_pandas(pser) > psser.astype("category") # internally calls > `map_scol[self.spark.column]`{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43620) Support `Column` for SparkConnectColumn.__getitem__
[ https://issues.apache.org/jira/browse/SPARK-43620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43620: Assignee: Haejoon Lee > Support `Column` for SparkConnectColumn.__getitem__ > --- > > Key: SPARK-43620 > URL: https://issues.apache.org/jira/browse/SPARK-43620 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > > Repro: > {code:java} > pser = pd.Series(["a", "b", "c"]) > psser = ps.from_pandas(pser) > psser.astype("category") # internally calls > `map_scol[self.spark.column]`{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45351) Change RocksDB as default shuffle service db backend
[ https://issues.apache.org/jira/browse/SPARK-45351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-45351. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43142 [https://github.com/apache/spark/pull/43142] > Change RocksDB as default shuffle service db backend > > > Key: SPARK-45351 > URL: https://issues.apache.org/jira/browse/SPARK-45351 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Jia Fan >Assignee: Jia Fan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Change RocksDB as default shuffle service db backend, because we will remove > leveldb in the future. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45351) Change RocksDB as default shuffle service db backend
[ https://issues.apache.org/jira/browse/SPARK-45351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-45351: Assignee: Jia Fan > Change RocksDB as default shuffle service db backend > > > Key: SPARK-45351 > URL: https://issues.apache.org/jira/browse/SPARK-45351 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Jia Fan >Assignee: Jia Fan >Priority: Major > Labels: pull-request-available > > Change RocksDB as default shuffle service db backend, because we will remove > leveldb in the future. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45406) Delete schema from DataFrame constructor
[ https://issues.apache.org/jira/browse/SPARK-45406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45406: --- Labels: pull-request-available (was: ) > Delete schema from DataFrame constructor > > > Key: SPARK-45406 > URL: https://issues.apache.org/jira/browse/SPARK-45406 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45406) Delete schema from DataFrame constructor
Ruifeng Zheng created SPARK-45406: - Summary: Delete schema from DataFrame constructor Key: SPARK-45406 URL: https://issues.apache.org/jira/browse/SPARK-45406 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45405) Refactor Python UDTF execution
Takuya Ueshin created SPARK-45405: - Summary: Refactor Python UDTF execution Key: SPARK-45405 URL: https://issues.apache.org/jira/browse/SPARK-45405 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45283) Make StatusTrackerSuite less fragile
[ https://issues.apache.org/jira/browse/SPARK-45283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-45283. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43194 [https://github.com/apache/spark/pull/43194] > Make StatusTrackerSuite less fragile > > > Key: SPARK-45283 > URL: https://issues.apache.org/jira/browse/SPARK-45283 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.5.0, 4.0.0 >Reporter: Bo Xiong >Assignee: Bo Xiong >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > It's discovered from [Github > Actions|https://github.com/xiongbo-sjtu/spark/actions/runs/6270601155/job/17028788767] > that StatusTrackerSuite can run into random failures, as shown by the > following stack trace (highlighted in red). The proposed fix is to update > the unit test to remove the nondeterministic behavior. > {quote}[info] StatusTrackerSuite: > [info] - basic status API usage (99 milliseconds) > [info] - getJobIdsForGroup() (56 milliseconds) > [info] - getJobIdsForGroup() with takeAsync() (48 milliseconds) > [info] - getJobIdsForGroup() with takeAsync() across multiple partitions (58 > milliseconds) > [info] - getJobIdsForTag() *** FAILED *** (10 seconds, 77 milliseconds) > {color:#ff}[info] The code passed to eventually never returned normally. > Attempted 651 times over 10.00505994401 seconds. Last failure message: > Set(3, 2, 1) was not equal to Set(1, 2). (StatusTrackerSuite.scala:148){color} > [info] org.scalatest.exceptions.TestFailedDueToTimeoutException: > [info] at > org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219) > [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226) > [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:348) > [info] at > org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:347) > [info] at > org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457) > [info] at > org.apache.spark.StatusTrackerSuite.$anonfun$new$21(StatusTrackerSuite.scala:148) > [info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127) > [info] at > org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282) > [info] at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231) > [info] at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230) > [info] at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69) > [info] at > org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:227) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > [info] at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:69) > [info] at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > [info] at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:69) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > [info] at scala.collection.immutable.List.foreach(List.scala:333) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) > [info] at org.scalatest.Suite.ru
[jira] [Assigned] (SPARK-45283) Make StatusTrackerSuite less fragile
[ https://issues.apache.org/jira/browse/SPARK-45283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-45283: --- Assignee: Bo Xiong > Make StatusTrackerSuite less fragile > > > Key: SPARK-45283 > URL: https://issues.apache.org/jira/browse/SPARK-45283 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.5.0, 4.0.0 >Reporter: Bo Xiong >Assignee: Bo Xiong >Priority: Minor > Labels: pull-request-available > Original Estimate: 1h > Remaining Estimate: 1h > > It's discovered from [Github > Actions|https://github.com/xiongbo-sjtu/spark/actions/runs/6270601155/job/17028788767] > that StatusTrackerSuite can run into random failures, as shown by the > following stack trace (highlighted in red). The proposed fix is to update > the unit test to remove the nondeterministic behavior. > {quote}[info] StatusTrackerSuite: > [info] - basic status API usage (99 milliseconds) > [info] - getJobIdsForGroup() (56 milliseconds) > [info] - getJobIdsForGroup() with takeAsync() (48 milliseconds) > [info] - getJobIdsForGroup() with takeAsync() across multiple partitions (58 > milliseconds) > [info] - getJobIdsForTag() *** FAILED *** (10 seconds, 77 milliseconds) > {color:#ff}[info] The code passed to eventually never returned normally. > Attempted 651 times over 10.00505994401 seconds. Last failure message: > Set(3, 2, 1) was not equal to Set(1, 2). (StatusTrackerSuite.scala:148){color} > [info] org.scalatest.exceptions.TestFailedDueToTimeoutException: > [info] at > org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219) > [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226) > [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:348) > [info] at > org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:347) > [info] at > org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457) > [info] at > org.apache.spark.StatusTrackerSuite.$anonfun$new$21(StatusTrackerSuite.scala:148) > [info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127) > [info] at > org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282) > [info] at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231) > [info] at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230) > [info] at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69) > [info] at > org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:227) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > [info] at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:69) > [info] at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > [info] at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:69) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > [info] at scala.collection.immutable.List.foreach(List.scala:333) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) > [info] at org.scalatest.Suite.run(Suite.scala:1114) > [info] at org.scalatest.Suite.run$(Suite.scala:1096) > [info] at > org.scalatest.funsuite.AnyFunSuite.org$scalate
[jira] [Assigned] (SPARK-45376) [CORE] Add netty-tcnative-boringssl-static dependency
[ https://issues.apache.org/jira/browse/SPARK-45376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-45376: --- Assignee: Hasnain Lakhani > [CORE] Add netty-tcnative-boringssl-static dependency > - > > Key: SPARK-45376 > URL: https://issues.apache.org/jira/browse/SPARK-45376 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hasnain Lakhani >Assignee: Hasnain Lakhani >Priority: Major > Labels: pull-request-available > > Add the boringssl dependency which is needed for SSL functionality to work, > and provide the network common test helper to other test modules which need > to test SSL functionality -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45376) [CORE] Add netty-tcnative-boringssl-static dependency
[ https://issues.apache.org/jira/browse/SPARK-45376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-45376. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43164 [https://github.com/apache/spark/pull/43164] > [CORE] Add netty-tcnative-boringssl-static dependency > - > > Key: SPARK-45376 > URL: https://issues.apache.org/jira/browse/SPARK-45376 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hasnain Lakhani >Assignee: Hasnain Lakhani >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Add the boringssl dependency which is needed for SSL functionality to work, > and provide the network common test helper to other test modules which need > to test SSL functionality -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45399) XML: Add XML Options using newOption
[ https://issues.apache.org/jira/browse/SPARK-45399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45399: Assignee: Sandip Agarwala > XML: Add XML Options using newOption > > > Key: SPARK-45399 > URL: https://issues.apache.org/jira/browse/SPARK-45399 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Sandip Agarwala >Assignee: Sandip Agarwala >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45399) XML: Add XML Options using newOption
[ https://issues.apache.org/jira/browse/SPARK-45399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45399. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43201 [https://github.com/apache/spark/pull/43201] > XML: Add XML Options using newOption > > > Key: SPARK-45399 > URL: https://issues.apache.org/jira/browse/SPARK-45399 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Sandip Agarwala >Assignee: Sandip Agarwala >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45347) Include SparkThrowable in FetchErrorDetailsResponse
[ https://issues.apache.org/jira/browse/SPARK-45347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45347. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43136 [https://github.com/apache/spark/pull/43136] > Include SparkThrowable in FetchErrorDetailsResponse > --- > > Key: SPARK-45347 > URL: https://issues.apache.org/jira/browse/SPARK-45347 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 4.0.0 >Reporter: Yihong He >Assignee: Yihong He >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45347) Include SparkThrowable in FetchErrorDetailsResponse
[ https://issues.apache.org/jira/browse/SPARK-45347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45347: Assignee: Yihong He > Include SparkThrowable in FetchErrorDetailsResponse > --- > > Key: SPARK-45347 > URL: https://issues.apache.org/jira/browse/SPARK-45347 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 4.0.0 >Reporter: Yihong He >Assignee: Yihong He >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45404) Support AWS_ENDPOINT_URL env variable
[ https://issues.apache.org/jira/browse/SPARK-45404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45404: --- Labels: pull-request-available (was: ) > Support AWS_ENDPOINT_URL env variable > - > > Key: SPARK-45404 > URL: https://issues.apache.org/jira/browse/SPARK-45404 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45404) Support AWS_ENDPOINT_URL env variable
Dongjoon Hyun created SPARK-45404: - Summary: Support AWS_ENDPOINT_URL env variable Key: SPARK-45404 URL: https://issues.apache.org/jira/browse/SPARK-45404 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45403) Spark SQL returns table column names as literal data values for Hive tables
[ https://issues.apache.org/jira/browse/SPARK-45403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reece Robinson updated SPARK-45403: --- Description: When using Spark SQL and Hive JDBC driver to access a Hive table the resulting row data is replaced with the literal column name in the resulting dataframe result. When I run this: jdbcDF = spark.read \ .format("jdbc") \ .options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://10.20.174.171:10009", user="10009", password="123", query="select * from demo.hospitals limit 10" ) \ .load() I get: +```+ ++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+ |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| ++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+ |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| ++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+ +```+ I should see: ``` ++++-+-+---++++--+---+--+-+-+ | person_pk| race_value|sex_code|poverty_value|veteran_value|ppr_pro| patient_pk| di_dk| pov_pk|vet_pk|veteran|total_paid|num_drugs|immunized| ++++-+-+---++++--+---+--+-+-+ |001252a7-a1e7-428...|01 - American Ind...| F| 37.0| null| 2|65007233-424e-4c2...|9d66f5b7-ab10-47f...|1f3d76c8-d039-483...| |unknown| null| null| true| |002673d4-579a-4d1...|01 - American Ind...| M| 64.0| null| 2|a3c89a7f-d57d-4be...|2f6ffa09-e5b3-419...|7dbfc730-64bc-4a9...| |unknown| null| null| true| |00267822-8192-44f...|01 - American Ind...| F| 0.0| null| 2|cd318b72-35d4-422...|44646492-60ef-44e...|d5f462ef-cd4c-497...| |unknown| null| null| true| |0028fece-59ec-4db...|01 - American Ind...| F| 0.0| null| 2|ee9e09aa-67bc-47e...|3be068de-7fe3-44d...|63a04010-c381-4aa...| |unknown| null| null| true| |003470e7-b548-444...|06 - American Ind...| M| 171.0| null| 2|7ed5b0f9-02b3-459...|1b778c9f-71ab-45a...|84ecc23a-6c39-44d...| |unknown| null| null| false| |0044a493-e226-409...|01 - American Ind...| F| 0.0| null| 2|c821f5b2-d0af-428...|26144dac-81f0-44e...|f7355eeb-89a3-4f0...| |unknown| null| null| true| |004d44d0-fdf7-403...|01 - American Ind...| F| 37.0| null| 2|cb6c8e5c-71ab-409...|88eaf3c4-5f00-4e9...|78679644-f4e7-450...| |unknown| null| null| true| |0059c1bf-5263-42a...|03 - Black or Afr...| M| 0.0| null| 2|da9247d1-96fb-44d...|6831544a-faf9-426...|3534f3a8-a367-41e...| |unknown| null| null| true| |007b82b6-ae2e-49e...|01 - American Ind...| M| 43.0| null| 2|3e6fcc8c-c484-465...|90e2a03f-f0a4-48f...|5c9c71e1-901b-481...| |unknown| null| null| t
[jira] [Updated] (SPARK-45403) Spark SQL returns table column names as literal data values for Hive tables
[ https://issues.apache.org/jira/browse/SPARK-45403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reece Robinson updated SPARK-45403: --- Description: When using Spark SQL and Hive JDBC driver to access a Hive table the resulting row data is replaced with the literal column name in the resulting dataframe result. When I run this: jdbcDF = spark.read \ .format("jdbc") \ .options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://10.20.174.171:10009", user="10009", password="123", query="select * from demo.hospitals limit 10" ) \ .load() I get: {+}---{-}{-}{+}-{-}++{-}--{-}{-}-{-}++{-}--{-}{-}---{-}++{-}-{-}{-}-{-}++{-}-{-}{-}-{-}++{-}---{-}{-}{-}++{-}--{-}{-}---{-}++{-}--{-}{-}-{-}++{-}-{-}{-}---{-}++{-}-{-}{-}{-}+ |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| ++{-}--{-}{-}-{-}++{-}--{-}{-}-{-}++{-}--{-}{-}---{-}++{-}-{-}{-}-{-}++{-}-{-}{-}-{-}++{-}---{-}{-}{-}++{-}--{-}{-}---{-}++{-}--{-}{-}-{-}++{-}-{-}{-}---{-}++{-}-{-}{-}{-}+ |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| ++{-}--{-}{-}-{-}++{-}--{-}{-}-{-}++{-}--{-}{-}---{-}++{-}-{-}{-}-{-}++{-}-{-}{-}-{-}++{-}---{-}{-}{-}++{-}--{-}{-}---{-}++{-}--{-}{-}-{-}++{-}-{-}{-}---{-}++{-}-{-}{-}-+ I should see: {+}---{-}{-}{+}--{-}++{-}--{-}{-}---{-}++{-}---{-}{-}-{-}++{-}--{-}{-}--{-}++{-}--{-}{-}{-}++{-}-{-}{-}{-}++{-}---{-}{-}---{-}+ | person_pk| race_value|sex_code|poverty_value|veteran_value|ppr_pro| patient_pk| di_dk| pov_pk|vet_pk|veteran|total_paid|num_drugs|immunized| ++{-}--{-}{-}--{-}++{-}--{-}{-}---{-}++{-}---{-}{-}-{-}++{-}--{-}{-}--{-}++{-}--{-}{-}{-}++{-}-{-}{-}{-}++{-}---{-}{-}---{-}+ |001252a7-a1e7-428...|01 - American Ind...| F| 37.0| null| 2|65007233-424e-4c2...|9d66f5b7-ab10-47f...|1f3d76c8-d039-483...| |unknown| null| null| true| |002673d4-579a-4d1...|01 - American Ind...| M| 64.0| null| 2|a3c89a7f-d57d-4be...|2f6ffa09-e5b3-419...|7dbfc730-64bc-4a9...| |unknown| null| null| true| |00267822-8192-44f...|01 - American Ind...| F| 0.0| null| 2|cd318b72-35d4-422...|44646492-60ef-44e...|d5f462ef-cd4c-497...| |unknown| null| null| true| |0028fece-59ec-4db...|01 - American Ind...| F| 0.0| null| 2|ee9e09aa-67bc-47e...|3be068de-7fe3-44d...|63a04010-c381-4aa...| |unknown| null| null| true| |003470e7-b548-444...|06 - American Ind...| M| 171.0| null| 2|7ed5b0f9-02b3-459...|1b778c9f-71ab-45a...|84ecc23a-6c39-44d...| |unknown| null| null| false| |0044a493-e226-409...|01 - American Ind...| F| 0.0| null| 2|c821f5b2-d0af-428...|26144dac-81f0-44e...|f7355eeb-89a3-4f0...| |unknown| null| null| true| |004d44d0-fdf7-403...|01 - American Ind...| F| 37.0| null| 2|cb6c8e5c-71ab-409...|88eaf3c4-5f00-4e9...|78679644-f4e7-450
[jira] [Updated] (SPARK-45403) Spark SQL returns table column names as literal data values for Hive tables
[ https://issues.apache.org/jira/browse/SPARK-45403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reece Robinson updated SPARK-45403: --- Attachment: Screenshot 2023-10-04 at 11.11.28 AM.png > Spark SQL returns table column names as literal data values for Hive tables > --- > > Key: SPARK-45403 > URL: https://issues.apache.org/jira/browse/SPARK-45403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 > Environment: I am using Spark 3.4.0 however this has been an issue > for years. >Reporter: Reece Robinson >Priority: Major > Attachments: Screenshot 2023-10-04 at 11.11.28 AM.png > > > When using Spark SQL and Hive JDBC driver to access a Hive table the > resulting row data is replaced with the literal column name in the resulting > dataframe result. > When I run this: > jdbcDF = spark.read \ > .format("jdbc") \ > .options(driver="org.apache.hive.jdbc.HiveDriver", > url="jdbc:hive2://10.20.174.171:10009", > user="10009", > password="123", > query="select * from demo.hospitals limit 10" > ) \ > .load() > > I get: > ++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+ > > |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| > > ++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+ > > |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| > > |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| > > |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| > > |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| > > |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| > > |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| > > |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| > > |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| > > |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| > > |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| > > ++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+ > > I should see: > ++++-+-+---++++--+---+--+-+-+ > | person_pk| race_value|sex_code|poverty_value|veteran_value|ppr_pro| > patient_pk| di_dk| pov_pk|vet_pk|veteran|total_paid|num_drugs|immunized| > ++++-+-+---++++--+---+--+-+-+ > |001252a7-a1e7-428...|01 - American Ind...| F| 37.0| null| > 2|65007233-424e-4c2...|9d66f5b7-ab10-47f...|1f3d76c8-d039-483...| |unknown| > null| null| true| |002673d4-579a-4d1...|01 - American Ind...| M| 64.0| null| > 2|a3c89a7f-d57d-4be...|2f6ffa09-e5b3-419...|7dbfc730-64bc-4a9...| |unknown| > null| null| true| |00267822-8192-44f...|01 - American Ind...| F| 0.0| null| > 2|cd318b72-35d4-422...|44646492-60ef-44e...|d5f462ef-cd4c-497...| |unknown| > null| null| true| |0028fece-59ec-4db...|01 - American Ind...| F| 0.0| null| > 2|ee9e09aa-67bc-47e...|3be068de-7fe3-44d...|63a04010-c381-4aa...| |unknown| > null| null| true| |003470e7-b548-444...|06 - American
[jira] [Created] (SPARK-45403) Spark SQL returns table column names as literal data values for Hive tables
Reece Robinson created SPARK-45403: -- Summary: Spark SQL returns table column names as literal data values for Hive tables Key: SPARK-45403 URL: https://issues.apache.org/jira/browse/SPARK-45403 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Environment: I am using Spark 3.4.0 however this has been an issue for years. Reporter: Reece Robinson When using Spark SQL and Hive JDBC driver to access a Hive table the resulting row data is replaced with the literal column name in the resulting dataframe result. When I run this: jdbcDF = spark.read \ .format("jdbc") \ .options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://10.20.174.171:10009", user="10009", password="123", query="select * from demo.hospitals limit 10" ) \ .load() I get: ++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+ |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| ++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+ |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| |provider_num|npi|name|address|city|state|zip|fips_county|lat|lon|phone|provider_type_code|category|emergency|upin|pin|region_code|bed_count|clia_lab_number|HIP_PK| ++---++---++-+---+---+---+---+-+--++-++---+---+-+---+--+ I should see: ++++-+-+---++++--+---+--+-+-+ | person_pk| race_value|sex_code|poverty_value|veteran_value|ppr_pro| patient_pk| di_dk| pov_pk|vet_pk|veteran|total_paid|num_drugs|immunized| ++++-+-+---++++--+---+--+-+-+ |001252a7-a1e7-428...|01 - American Ind...| F| 37.0| null| 2|65007233-424e-4c2...|9d66f5b7-ab10-47f...|1f3d76c8-d039-483...| |unknown| null| null| true| |002673d4-579a-4d1...|01 - American Ind...| M| 64.0| null| 2|a3c89a7f-d57d-4be...|2f6ffa09-e5b3-419...|7dbfc730-64bc-4a9...| |unknown| null| null| true| |00267822-8192-44f...|01 - American Ind...| F| 0.0| null| 2|cd318b72-35d4-422...|44646492-60ef-44e...|d5f462ef-cd4c-497...| |unknown| null| null| true| |0028fece-59ec-4db...|01 - American Ind...| F| 0.0| null| 2|ee9e09aa-67bc-47e...|3be068de-7fe3-44d...|63a04010-c381-4aa...| |unknown| null| null| true| |003470e7-b548-444...|06 - American Ind...| M| 171.0| null| 2|7ed5b0f9-02b3-459...|1b778c9f-71ab-45a...|84ecc23a-6c39-44d...| |unknown| null| null| false| |0044a493-e226-409...|01 - American Ind...| F| 0.0| null| 2|c821f5b2-d0af-428...|26144dac-81f0-44e...|f7355eeb-89a3-4f0...| |unknown| null| null| true| |004d44d0-fdf7-403...|01 - American Ind...| F| 37.0| null| 2|cb6c8e5c-71ab-409...|88eaf3c4-5f00-4e9...|78679644-f4e7-450...| |unknown| null| null| true| |0059c1bf-5263-42a...|03
[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reece Robinson updated SPARK-27943: --- Attachment: (was: Screenshot 2023-10-04 at 11.11.28 AM.png) > Implement default constraint with Column for Hive table > --- > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jiaan Geng >Priority: Major > > > *Background* > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. > *Design* > Hive is widely used in production environments and is the standard in the > field of big data in fact. > But Hive exists many version used in production and the feature between each > version are different. > Spark SQL need to implement default constraint, but there are three points to > pay attention to in design: > _First_, Spark SQL should reduce coupling with Hive. > _Second_, default constraint could compatible with different versions of Hive. > _Thrid_, Which expression of default constraint should Spark SQL support? I > think should support `literal`, `current_date()`, `current_timestamp()`. > Maybe other expression should also supported, like `Cast(1 as float)`, `1 + > 2` and so on. > We want to save the metadata of default constraint into properties of Hive > table, and then we restore metadata from the properties after client gets > newest metadata.The implement is the same as other metadata (e.g. > partition,bucket,statistics). > Because default constraint is part of column, so I think could reuse the > metadata of StructField. The default constraint will cached by metadata of > StructField. > > *Tasks* > This is a big work, wo I want to split this work into some sub tasks, as > follows: > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reece Robinson updated SPARK-27943: --- Attachment: Screenshot 2023-10-04 at 11.11.28 AM.png > Implement default constraint with Column for Hive table > --- > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jiaan Geng >Priority: Major > Attachments: Screenshot 2023-10-04 at 11.11.28 AM.png > > > > *Background* > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. > *Design* > Hive is widely used in production environments and is the standard in the > field of big data in fact. > But Hive exists many version used in production and the feature between each > version are different. > Spark SQL need to implement default constraint, but there are three points to > pay attention to in design: > _First_, Spark SQL should reduce coupling with Hive. > _Second_, default constraint could compatible with different versions of Hive. > _Thrid_, Which expression of default constraint should Spark SQL support? I > think should support `literal`, `current_date()`, `current_timestamp()`. > Maybe other expression should also supported, like `Cast(1 as float)`, `1 + > 2` and so on. > We want to save the metadata of default constraint into properties of Hive > table, and then we restore metadata from the properties after client gets > newest metadata.The implement is the same as other metadata (e.g. > partition,bucket,statistics). > Because default constraint is part of column, so I think could reuse the > metadata of StructField. The default constraint will cached by metadata of > StructField. > > *Tasks* > This is a big work, wo I want to split this work into some sub tasks, as > follows: > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45402) Add API for 'analyze' method to return a buffer to be consumed on each class creation
[ https://issues.apache.org/jira/browse/SPARK-45402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45402: --- Labels: pull-request-available (was: ) > Add API for 'analyze' method to return a buffer to be consumed on each class > creation > - > > Key: SPARK-45402 > URL: https://issues.apache.org/jira/browse/SPARK-45402 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Daniel >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45402) Add API for 'analyze' method to return a buffer to be consumed on each class creation
Daniel created SPARK-45402: -- Summary: Add API for 'analyze' method to return a buffer to be consumed on each class creation Key: SPARK-45402 URL: https://issues.apache.org/jira/browse/SPARK-45402 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Daniel -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45400) Refer to the unescaping rules from expression descriptions
Max Gekk created SPARK-45400: Summary: Refer to the unescaping rules from expression descriptions Key: SPARK-45400 URL: https://issues.apache.org/jira/browse/SPARK-45400 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 4.0.0 Reporter: Max Gekk Assignee: Max Gekk Update the expression/function description and refer to the unescaping rules in the items where regexp parameters are described. This should less confuse users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44219) Add extra per-rule validation for optimization rewrites.
[ https://issues.apache.org/jira/browse/SPARK-44219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44219: --- Labels: pull-request-available (was: ) > Add extra per-rule validation for optimization rewrites. > > > Key: SPARK-44219 > URL: https://issues.apache.org/jira/browse/SPARK-44219 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.4.0, 3.4.1 >Reporter: Yannis Sismanis >Priority: Major > Labels: pull-request-available > > Adds per-rule validation checks for the following: > 1. aggregate expressions in Aggregate plans are valid. > 2. Grouping key types in Aggregate plans cannot by of type Map. > 3. No dangling references have been generated. > This is validation is by default enabled for all tests or selectively using > the spark.sql.planChangeValidation=true flag. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45136) Improve ClosureCleaner to support closures defined in Ammonite REPL
[ https://issues.apache.org/jira/browse/SPARK-45136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45136: --- Labels: pull-request-available (was: ) > Improve ClosureCleaner to support closures defined in Ammonite REPL > --- > > Key: SPARK-45136 > URL: https://issues.apache.org/jira/browse/SPARK-45136 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0, 3.5.1 >Reporter: Vsevolod Stepanov >Priority: Major > Labels: pull-request-available > > ConnectRepl uses Ammonite REPL with CodeClassWrapper to run Scala code. It > means that each code cell is wrapped into a separate object. If there are > multiple variables defined in the same cell / code block it will lead to > capturing extra variables, increasing serialized UDF payload size or making > it non-serializable. > For example, this code > {code:java} > // cell 1 > { > val x = 100 > val y = new NonSerializable > } > // cell 2 > spark.range(10).map(i => i + x).agg(sum("value")).collect(){code} > will fail because lambda will capture both `x` and `y` as they're defined in > the same wrapper object -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45399) XML: Add XML Options using newOption
[ https://issues.apache.org/jira/browse/SPARK-45399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45399: --- Labels: pull-request-available (was: ) > XML: Add XML Options using newOption > > > Key: SPARK-45399 > URL: https://issues.apache.org/jira/browse/SPARK-45399 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Sandip Agarwala >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45399) XML: Add XML Options using newOption
Sandip Agarwala created SPARK-45399: --- Summary: XML: Add XML Options using newOption Key: SPARK-45399 URL: https://issues.apache.org/jira/browse/SPARK-45399 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Sandip Agarwala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45398) Include `ESCAPE` to `sql()` of `Like`
Max Gekk created SPARK-45398: Summary: Include `ESCAPE` to `sql()` of `Like` Key: SPARK-45398 URL: https://issues.apache.org/jira/browse/SPARK-45398 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Max Gekk Assignee: Max Gekk Fix the `sql()` method of the `Like` expression and append the `ESCAPE` closure. That should become consistent to `toString` and fix the issue: {code:sql} spark-sql (default)> create temp view tbl as (SELECT 'a|_' like 'a||_' escape '|', 'a|_' like 'a||_' escape 'a'); [COLUMN_ALREADY_EXISTS] The column `a|_ like a||_` already exists. Consider to choose another name or rename the existing column. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37467) Consolidate whole stage and non-whole stage subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-37467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-37467: --- Labels: pull-request-available (was: ) > Consolidate whole stage and non-whole stage subexpression elimination > - > > Key: SPARK-37467 > URL: https://issues.apache.org/jira/browse/SPARK-37467 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Adam Binford >Priority: Major > Labels: pull-request-available > > Currently there are separate subexpression elimination paths for whole stage > and non-whole stage codegen. Consolidating these into a single code path > would make it simpler to add further enhancements, such as supporting lambda > functions (https://issues.apache.org/jira/browse/SPARK-37466). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44848) MLlib GBTClassifier has wrong impurity method 'variance' instead of 'gini' or 'entropy'.
[ https://issues.apache.org/jira/browse/SPARK-44848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771433#comment-17771433 ] Oumar Nour commented on SPARK-44848: Hello, I have the same issue. I want to know if that issue is solved ? Thanks > MLlib GBTClassifier has wrong impurity method 'variance' instead of 'gini' or > 'entropy'. > - > > Key: SPARK-44848 > URL: https://issues.apache.org/jira/browse/SPARK-44848 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.4.1 >Reporter: Elisabeth Niederbacher >Priority: Major > > Impurity method 'variance' should only be used for regressors, *not* > classifiers. For classifiers gini and entropy should be available as it is > already the case for the RandomForestClassifier > [https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html] > . > Because of this bug 'minInfoGain' hyperparameter cannot be tuned to combat > overfitting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to d
[ https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-34591: --- Labels: pull-request-available pyspark (was: pyspark) > Pyspark undertakes pruning of decision trees and random forests outside the > control of the user, leading to undesirable and unexpected outcomes that are > challenging to diagnose and impossible to correct > -- > > Key: SPARK-34591 > URL: https://issues.apache.org/jira/browse/SPARK-34591 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0, 2.4.4, 3.1.1 >Reporter: Julian King >Priority: Major > Labels: pull-request-available, pyspark > Attachments: Reproducible example of Spark bug - no 2.pdf, > Reproducible example of Spark bug.pdf > > > *History of the issue* > SPARK-3159 implemented a method designed to reduce the computational burden > for predictions from decision trees and random forests by pruning the tree > after fitting. This is done in such a way that branches where child leaves > all produce the same classification prediction are merged. > This was implemented via a PR: [https://github.com/apache/spark/pull/20632] > This feature is controllable by a "prune" parameter in the Scala version of > the code, which is set to True as the default behaviour. However, this > parameter is not exposed in the Pyspark API, resulting in the behaviour above: > * Occurring always (despite the user may not wanting it to occur) > * Not being documented in the ML documentation, leading to decision tree > behavoiur that may be in conflict with what the user expects to happen > *Why is this a problem?* > +Problem 1: Inaccurate probabilities+ > Because the decision to prune is based on the classification prediction from > the tree (not the probability prediction from the node), this introduces > additional bias compared to the situation where the pruning is not done. The > impact here may be severe in some cases > +Problem 2: Leads to completely unacceptable behaviours in some circumstances > and for some hyper-parameters+ > My colleagues and I encountered this bug in a scenario where we could not get > a decision tree classifier (or random forest classifier with a single tree) > to split a single node, despite this being eminently supported by the data. > This renders the decision trees and random forests complete unusable > +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and > how they interact with the data+ > Small changes in the hyper-parameters should ideally produce small changes in > the built trees. However, here we have found that small changes in the > hyper-parameters lead to large and unpredictable changes in the resultant > trees as a result of this pruning. > In principle, this high degree of instability means that re-training the same > model, with the same hyper-parameter settings, on slightly different data may > lead to large variations in the tree structure simply as a result of the > pruning > +Problem 4: The problems above are much worse for unbalanced data sets+ > Probability estimation on unbalanced data sets using trees should be > supported, but the pruning method described will make this very difficult > +Problem 5: This pruning method is a substantial variation from the > description of the decision tree algorithm in the MLLib documents and is not > described+ > This made it extremely confusing for us in working out why we were seeing > certain behaviours - we had to trace back through all of the Spark detailed > release notes to identify where the problem might. > *Proposed solutions* > +Option 1 (much easier):+ > The proposed solution here is: > * Set the default pruning behaviour to False rather than True, thereby > bringing the default behaviour back into alignment with the documentation > whilst avoiding the issues described above > +Option 2 (more involved):+ > The proposed solution here is: > * Leave the default pruning behaviour set to False > * Expand the pyspark API to expose the pruning behaviour as a > user-controllable option > * Document the change to the API > * Document the change to the tree building behaviour at appropriate points > in the Spark ML and Spark MLLib documentation > We recommend that the default behaviour be set to False because this approach > is not the generally understood approach for building decision trees, where > pruning is decided a separate and user-controllable step. > -- This message was sent by Atlassian Jira (v8.20.10#820010) -
[jira] [Updated] (SPARK-45397) Add vector assembler feature transformer
[ https://issues.apache.org/jira/browse/SPARK-45397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45397: --- Labels: pull-request-available (was: ) > Add vector assembler feature transformer > > > Key: SPARK-45397 > URL: https://issues.apache.org/jira/browse/SPARK-45397 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML, PySpark >Affects Versions: 3.5.1 >Reporter: Weichen Xu >Priority: Major > Labels: pull-request-available > > Add vector assembler feature transformer -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45355) Fix function groups in Scala Doc
[ https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45355: -- Assignee: (was: Apache Spark) > Fix function groups in Scala Doc > > > Key: SPARK-45355 > URL: https://issues.apache.org/jira/browse/SPARK-45355 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45355) Fix function groups in Scala Doc
[ https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45355: -- Assignee: Apache Spark > Fix function groups in Scala Doc > > > Key: SPARK-45355 > URL: https://issues.apache.org/jira/browse/SPARK-45355 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45355) Fix function groups in Scala Doc
[ https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45355: -- Assignee: Apache Spark > Fix function groups in Scala Doc > > > Key: SPARK-45355 > URL: https://issues.apache.org/jira/browse/SPARK-45355 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45355) Fix function groups in Scala Doc
[ https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45355: -- Assignee: (was: Apache Spark) > Fix function groups in Scala Doc > > > Key: SPARK-45355 > URL: https://issues.apache.org/jira/browse/SPARK-45355 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45355) Fix function groups in Scala Doc
[ https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng updated SPARK-45355: --- Summary: Fix function groups in Scala Doc (was: Re group functions in scala doc) > Fix function groups in Scala Doc > > > Key: SPARK-45355 > URL: https://issues.apache.org/jira/browse/SPARK-45355 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45355) Re group functions in scala doc
[ https://issues.apache.org/jira/browse/SPARK-45355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng updated SPARK-45355: --- Summary: Re group functions in scala doc (was: Fix function groups in Scala Doc) > Re group functions in scala doc > --- > > Key: SPARK-45355 > URL: https://issues.apache.org/jira/browse/SPARK-45355 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45397) Add vector assembler feature transformer
Weichen Xu created SPARK-45397: -- Summary: Add vector assembler feature transformer Key: SPARK-45397 URL: https://issues.apache.org/jira/browse/SPARK-45397 Project: Spark Issue Type: Sub-task Components: Connect, ML, PySpark Affects Versions: 3.5.1 Reporter: Weichen Xu Add vector assembler feature transformer -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org