[jira] [Assigned] (SPARK-38076) Remove redundant null-check is covered by further condition

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38076:
-

Assignee: Yang Jie

> Remove redundant null-check is covered by further condition
> ---
>
> Key: SPARK-38076
> URL: https://issues.apache.org/jira/browse/SPARK-38076
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> There are many code pattern as follows:
> {code:java}
> obj != null && obj instanceof SomeClass {code}
> the null-check is redundant as instanceof operator implies non-nullity.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38076) Remove redundant null-check is covered by further condition

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38076.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35369
[https://github.com/apache/spark/pull/35369]

> Remove redundant null-check is covered by further condition
> ---
>
> Key: SPARK-38076
> URL: https://issues.apache.org/jira/browse/SPARK-38076
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>
> There are many code pattern as follows:
> {code:java}
> obj != null && obj instanceof SomeClass {code}
> the null-check is redundant as instanceof operator implies non-nullity.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38080) Flaky test: StreamingQueryManagerSuite: 'awaitAnyTermination with timeout and resetTerminated'

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38080:
-

Assignee: Shixiong Zhu

> Flaky test: StreamingQueryManagerSuite: 'awaitAnyTermination with timeout and 
> resetTerminated'
> --
>
> Key: SPARK-38080
> URL: https://issues.apache.org/jira/browse/SPARK-38080
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Affects Versions: 3.2.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> {code:java}
> [info] - awaitAnyTermination with timeout and resetTerminated *** FAILED *** 
> (14 seconds, 304 milliseconds)
> [info]   Did not throw SparkException when expected. Expected exception 
> org.apache.spark.sql.streaming.StreamingQueryException to be thrown, but 
> org.scalatest.exceptions.TestFailedException was thrown (StreamTest.scala:935)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563)
> [info]   at org.scalatest.Assertions.intercept(Assertions.scala:756)
> [info]   at org.scalatest.Assertions.intercept$(Assertions.scala:746)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1563)
> [info]   at 
> org.apache.spark.sql.streaming.StreamTest$AwaitTerminationTester$.$anonfun$test$4(StreamTest.scala:935)
> [info]   at org.scalatest.Assertions.withClue(Assertions.scala:1065)
> [info]   at org.scalatest.Assertions.withClue$(Assertions.scala:1052)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1563)
> [info]   at 
> org.apache.spark.sql.streaming.StreamTest$AwaitTerminationTester$.test(StreamTest.scala:935)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.testAwaitAnyTermination(StreamingQueryManagerSuite.scala:445)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$10(StreamingQueryManagerSuite.scala:221)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$10$adapted(StreamingQueryManagerSuite.scala:140)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$withQueriesOn$1(StreamingQueryManagerSuite.scala:421)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
> [info]   at 
> org.scalatest.concurrent.TimeLimits.failAfterImpl(TimeLimits.scala:239)
> [info]   at 
> org.scalatest.concurrent.TimeLimits.failAfterImpl$(TimeLimits.scala:233)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfterImpl(StreamingQueryManagerSuite.scala:39)
> [info]   at 
> org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:230)
> [info]   at 
> org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:229)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfter(StreamingQueryManagerSuite.scala:39)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.withQueriesOn(StreamingQueryManagerSuite.scala:397)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$8(StreamingQueryManagerSuite.scala:140)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at 
> org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtils.$anonfun$testQuietly$1(SQLTestUtils.scala:115)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200)
> 

[jira] [Resolved] (SPARK-38080) Flaky test: StreamingQueryManagerSuite: 'awaitAnyTermination with timeout and resetTerminated'

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38080.
---
Fix Version/s: 3.3.0
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 35372
[https://github.com/apache/spark/pull/35372]

> Flaky test: StreamingQueryManagerSuite: 'awaitAnyTermination with timeout and 
> resetTerminated'
> --
>
> Key: SPARK-38080
> URL: https://issues.apache.org/jira/browse/SPARK-38080
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Affects Versions: 3.2.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> {code:java}
> [info] - awaitAnyTermination with timeout and resetTerminated *** FAILED *** 
> (14 seconds, 304 milliseconds)
> [info]   Did not throw SparkException when expected. Expected exception 
> org.apache.spark.sql.streaming.StreamingQueryException to be thrown, but 
> org.scalatest.exceptions.TestFailedException was thrown (StreamTest.scala:935)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563)
> [info]   at org.scalatest.Assertions.intercept(Assertions.scala:756)
> [info]   at org.scalatest.Assertions.intercept$(Assertions.scala:746)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1563)
> [info]   at 
> org.apache.spark.sql.streaming.StreamTest$AwaitTerminationTester$.$anonfun$test$4(StreamTest.scala:935)
> [info]   at org.scalatest.Assertions.withClue(Assertions.scala:1065)
> [info]   at org.scalatest.Assertions.withClue$(Assertions.scala:1052)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1563)
> [info]   at 
> org.apache.spark.sql.streaming.StreamTest$AwaitTerminationTester$.test(StreamTest.scala:935)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.testAwaitAnyTermination(StreamingQueryManagerSuite.scala:445)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$10(StreamingQueryManagerSuite.scala:221)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$10$adapted(StreamingQueryManagerSuite.scala:140)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$withQueriesOn$1(StreamingQueryManagerSuite.scala:421)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
> [info]   at 
> org.scalatest.concurrent.TimeLimits.failAfterImpl(TimeLimits.scala:239)
> [info]   at 
> org.scalatest.concurrent.TimeLimits.failAfterImpl$(TimeLimits.scala:233)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfterImpl(StreamingQueryManagerSuite.scala:39)
> [info]   at 
> org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:230)
> [info]   at 
> org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:229)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfter(StreamingQueryManagerSuite.scala:39)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.withQueriesOn(StreamingQueryManagerSuite.scala:397)
> [info]   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$8(StreamingQueryManagerSuite.scala:140)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at 
> org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtils.$anonfun$testQuietly$1(SQLTestUtils.scala:115)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)

[jira] [Resolved] (SPARK-37397) Inline type hints for python/pyspark/ml/base.py

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37397.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35289
[https://github.com/apache/spark/pull/35289]

> Inline type hints for python/pyspark/ml/base.py
> ---
>
> Key: SPARK-37397
> URL: https://issues.apache.org/jira/browse/SPARK-37397
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints from python/pyspark/ml/base.pyi to 
> python/pyspark/ml/base.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37397) Inline type hints for python/pyspark/ml/base.py

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37397:
-

Assignee: Maciej Szymkiewicz

> Inline type hints for python/pyspark/ml/base.py
> ---
>
> Key: SPARK-37397
> URL: https://issues.apache.org/jira/browse/SPARK-37397
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/base.pyi to 
> python/pyspark/ml/base.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38013:
--
Component/s: Tests

> AQE can change bhj to smj if no extra shuffle introduce
> ---
>
> Key: SPARK-38013
> URL: https://issues.apache.org/jira/browse/SPARK-38013
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> An example to reproduce the bug.
> {code:java}
> create table t1 as select 1 c1, 2 c2;
> create table t2 as select 1 c1, 2 c2;
> create table t3 as select 1 c1, 2 c2;
> set spark.sql.adaptive.autoBroadcastJoinThreshold=-1;
> select /*+ merge(t3) */ * from t1
> left join (
> select c1 as c from t3
> ) t3 on t1.c1 = t3.c
> left join (
> select /*+ repartition(c1) */ c1 from t2
> ) t2 on t1.c1 = t2.c1;
> {code}
> The key to produce this bug is that a bhj convert to smj/shj without 
> introducing extra shuffe and AQE does not think the join can be planned as 
> bhj.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-38013:
---

> AQE can change bhj to smj if no extra shuffle introduce
> ---
>
> Key: SPARK-38013
> URL: https://issues.apache.org/jira/browse/SPARK-38013
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> An example to reproduce the bug.
> {code:java}
> create table t1 as select 1 c1, 2 c2;
> create table t2 as select 1 c1, 2 c2;
> create table t3 as select 1 c1, 2 c2;
> set spark.sql.adaptive.autoBroadcastJoinThreshold=-1;
> select /*+ merge(t3) */ * from t1
> left join (
> select c1 as c from t3
> ) t3 on t1.c1 = t3.c
> left join (
> select /*+ repartition(c1) */ c1 from t2
> ) t2 on t1.c1 = t2.c1;
> {code}
> The key to produce this bug is that a bhj convert to smj/shj without 
> introducing extra shuffe and AQE does not think the join can be planned as 
> bhj.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38013.
---
Fix Version/s: 3.3.0
   3.2.2
   Resolution: Fixed

This JIRA added a new test coverage.

> AQE can change bhj to smj if no extra shuffle introduce
> ---
>
> Key: SPARK-38013
> URL: https://issues.apache.org/jira/browse/SPARK-38013
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> An example to reproduce the bug.
> {code:java}
> create table t1 as select 1 c1, 2 c2;
> create table t2 as select 1 c1, 2 c2;
> create table t3 as select 1 c1, 2 c2;
> set spark.sql.adaptive.autoBroadcastJoinThreshold=-1;
> select /*+ merge(t3) */ * from t1
> left join (
> select c1 as c from t3
> ) t3 on t1.c1 = t3.c
> left join (
> select /*+ repartition(c1) */ c1 from t2
> ) t2 on t1.c1 = t2.c1;
> {code}
> The key to produce this bug is that a bhj convert to smj/shj without 
> introducing extra shuffe and AQE does not think the join can be planned as 
> bhj.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38013:
-

Assignee: XiDuo You

> AQE can change bhj to smj if no extra shuffle introduce
> ---
>
> Key: SPARK-38013
> URL: https://issues.apache.org/jira/browse/SPARK-38013
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>
> An example to reproduce the bug.
> {code:java}
> create table t1 as select 1 c1, 2 c2;
> create table t2 as select 1 c1, 2 c2;
> create table t3 as select 1 c1, 2 c2;
> set spark.sql.adaptive.autoBroadcastJoinThreshold=-1;
> select /*+ merge(t3) */ * from t1
> left join (
> select c1 as c from t3
> ) t3 on t1.c1 = t3.c
> left join (
> select /*+ repartition(c1) */ c1 from t2
> ) t2 on t1.c1 = t2.c1;
> {code}
> The key to produce this bug is that a bhj convert to smj/shj without 
> introducing extra shuffe and AQE does not think the join can be planned as 
> bhj.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38087) select doesnt validate if the column already exists

2022-02-01 Thread Deepa Vasanthkumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepa Vasanthkumar updated SPARK-38087:
---
Description: 
 

Select doesnt validate whether the alias column is already present in the 
dataframe. 

After which, we cannot do anything in that dataframe on that column. 
df4 = df2.select(df2.firstname, df2.lastname) --> throws analysis exception
df4.show()
 

However drop will not let you drop the said column. 

 

Scenario to reproduce :
df2 = df1.select("*", (df1.firstname).alias("firstname"))   ---> this will add 
same column
df2.show() 
df2.drop(df2.firstname) --> this will give AnalysisException: Reference 
'firstname' is ambiguous, could be: firstname, firstname.
 

 

Is this expected behavior .
  !select vs drop.png!
!image-2022-02-02-06-28-23-543.png!
 
 
 
 
 

 

  was:
 

Select doesnt validate whether the alias column is already present in the 
dataframe. 

However drop will not let you drop the said column.

 

Scenario to reproduce :
df2 = df1.select("*", (df1.firstname).alias("firstname"))   ---> this will add 
same column
df2.show() 
df2.drop(df2.firstname) --> this will give AnalysisException: Reference 
'firstname' is ambiguous, could be: firstname, firstname.
 

 

Is this expected behavior .
  !select vs drop.png!
!image-2022-02-02-06-28-23-543.png!
 
 
 
 
 

 


> select doesnt validate if the column already exists
> ---
>
> Key: SPARK-38087
> URL: https://issues.apache.org/jira/browse/SPARK-38087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
> Environment: Version{{{}v3.2.1{}}}
> {{}}
> {{{}{}}}Master{{{}local[*]{}}}
> {{(Reproducible in any environment)}}
>Reporter: Deepa Vasanthkumar
>Priority: Minor
> Fix For: 3.3
>
> Attachments: select vs drop.png
>
>
>  
> Select doesnt validate whether the alias column is already present in the 
> dataframe. 
> After which, we cannot do anything in that dataframe on that column. 
> df4 = df2.select(df2.firstname, df2.lastname) --> throws analysis exception
> df4.show()
>  
> However drop will not let you drop the said column. 
>  
> Scenario to reproduce :
> df2 = df1.select("*", (df1.firstname).alias("firstname"))   ---> this will 
> add same column
> df2.show() 
> df2.drop(df2.firstname) --> this will give AnalysisException: Reference 
> 'firstname' is ambiguous, could be: firstname, firstname.
>  
>  
> Is this expected behavior .
>   !select vs drop.png!
> !image-2022-02-02-06-28-23-543.png!
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38087) select doesnt validate if the column already exists

2022-02-01 Thread Deepa Vasanthkumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepa Vasanthkumar updated SPARK-38087:
---
Attachment: select vs drop.png

> select doesnt validate if the column already exists
> ---
>
> Key: SPARK-38087
> URL: https://issues.apache.org/jira/browse/SPARK-38087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
> Environment: Version{{{}v3.2.1{}}}
> {{}}
> {{{}{}}}Master{{{}local[*]{}}}
> {{(Reproducible in any environment)}}
>Reporter: Deepa Vasanthkumar
>Priority: Minor
> Fix For: 3.3
>
> Attachments: select vs drop.png
>
>
>  
> Select doesnt validate whether the alias column is already present in the 
> dataframe. 
> However drop will not let you drop the said column.
>  
> Scenario to reproduce :
> df2 = df1.select("*", (df1.firstname).alias("firstname"))   ---> this will 
> add same column
> df2.show() 
> df2.drop(df2.firstname) --> this will give AnalysisException: Reference 
> 'firstname' is ambiguous, could be: firstname, firstname.
>  
> Is this expected behavior .
>  
>  
> !image-2022-02-02-06-28-23-543.png!
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38087) select doesnt validate if the column already exists

2022-02-01 Thread Deepa Vasanthkumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepa Vasanthkumar updated SPARK-38087:
---
Description: 
 

Select doesnt validate whether the alias column is already present in the 
dataframe. 

However drop will not let you drop the said column.

 

Scenario to reproduce :
df2 = df1.select("*", (df1.firstname).alias("firstname"))   ---> this will add 
same column
df2.show() 
df2.drop(df2.firstname) --> this will give AnalysisException: Reference 
'firstname' is ambiguous, could be: firstname, firstname.
 

 

Is this expected behavior .
  !select vs drop.png!
!image-2022-02-02-06-28-23-543.png!
 
 
 
 
 

 

  was:
 

Select doesnt validate whether the alias column is already present in the 
dataframe. 

However drop will not let you drop the said column.

 

Scenario to reproduce :
df2 = df1.select("*", (df1.firstname).alias("firstname"))   ---> this will add 
same column
df2.show() 
df2.drop(df2.firstname) --> this will give AnalysisException: Reference 
'firstname' is ambiguous, could be: firstname, firstname.
 
Is this expected behavior .
 
 
!image-2022-02-02-06-28-23-543.png!
 
 
 
 
 

 


> select doesnt validate if the column already exists
> ---
>
> Key: SPARK-38087
> URL: https://issues.apache.org/jira/browse/SPARK-38087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
> Environment: Version{{{}v3.2.1{}}}
> {{}}
> {{{}{}}}Master{{{}local[*]{}}}
> {{(Reproducible in any environment)}}
>Reporter: Deepa Vasanthkumar
>Priority: Minor
> Fix For: 3.3
>
> Attachments: select vs drop.png
>
>
>  
> Select doesnt validate whether the alias column is already present in the 
> dataframe. 
> However drop will not let you drop the said column.
>  
> Scenario to reproduce :
> df2 = df1.select("*", (df1.firstname).alias("firstname"))   ---> this will 
> add same column
> df2.show() 
> df2.drop(df2.firstname) --> this will give AnalysisException: Reference 
> 'firstname' is ambiguous, could be: firstname, firstname.
>  
>  
> Is this expected behavior .
>   !select vs drop.png!
> !image-2022-02-02-06-28-23-543.png!
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38087) select doesnt validate if the column already exists

2022-02-01 Thread Deepa Vasanthkumar (Jira)
Deepa Vasanthkumar created SPARK-38087:
--

 Summary: select doesnt validate if the column already exists
 Key: SPARK-38087
 URL: https://issues.apache.org/jira/browse/SPARK-38087
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.1
 Environment: Version{{{}v3.2.1{}}}

{{}}

{{{}{}}}Master{{{}local[*]{}}}

{{(Reproducible in any environment)}}
Reporter: Deepa Vasanthkumar
 Fix For: 3.3


 

Select doesnt validate whether the alias column is already present in the 
dataframe. 

However drop will not let you drop the said column.

 

Scenario to reproduce :
df2 = df1.select("*", (df1.firstname).alias("firstname"))   ---> this will add 
same column
df2.show() 
df2.drop(df2.firstname) --> this will give AnalysisException: Reference 
'firstname' is ambiguous, could be: firstname, firstname.
 
Is this expected behavior .
 
 
!image-2022-02-02-06-28-23-543.png!
 
 
 
 
 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25789) Support for Dataset of Avro

2022-02-01 Thread Prashant Pandey (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485580#comment-17485580
 ] 

Prashant Pandey edited comment on SPARK-25789 at 2/2/22, 5:51 AM:
--

Is there any update on this ticket? We are facing this problem trying to use an 
Avro generated class to create a Dataset from a Dataframe.

 

"{{{}UnsupportedOperationException: Cannot have circular references in bean 
class, but got the circular reference of class class org.apache.avro.Schema"{}}}

 

[https://stackoverflow.com/questions/70950967/circular-reference-in-bean-class-while-creating-a-dataset-from-an-avro-generated]


was (Author: pandepra):
Is there any update on this ticket? We are facing this problem trying to use an 
Avro generated class to create a Dataset from a Dataframe.

 

https://stackoverflow.com/questions/70950967/circular-reference-in-bean-class-while-creating-a-dataset-from-an-avro-generated

> Support for Dataset of Avro
> ---
>
> Key: SPARK-25789
> URL: https://issues.apache.org/jira/browse/SPARK-25789
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Aleksander Eskilson
>Priority: Major
>
> Support for Dataset of Avro records in an API that would allow the user to 
> provide a class to an {{Encoder}} for Avro, analogous to the {{Bean}} 
> encoder. This functionality was previously to be provided by SPARK-22739 and 
> [Spark-Avro #169|https://github.com/databricks/spark-avro/issues/169]. Avro 
> functionality was folded into Spark-proper by SPARK-24768, eliminating the 
> need to maintain a separate library for Avro in Spark. Resolution of this 
> issue would:
>  * Add necessary {{Expression}} elements to Spark
>  * Add an {{AvroEncoder}} for Datasets of Avro records to Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25789) Support for Dataset of Avro

2022-02-01 Thread Prashant Pandey (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485580#comment-17485580
 ] 

Prashant Pandey edited comment on SPARK-25789 at 2/2/22, 5:50 AM:
--

Is there any update on this ticket? We are facing this problem trying to use an 
Avro generated class to create a Dataset from a Dataframe.

 

https://stackoverflow.com/questions/70950967/circular-reference-in-bean-class-while-creating-a-dataset-from-an-avro-generated


was (Author: pandepra):
Is there any update on this ticket? We are facing this problem trying to use an 
Avro generated class to create a Dataset from a Dataframe.

> Support for Dataset of Avro
> ---
>
> Key: SPARK-25789
> URL: https://issues.apache.org/jira/browse/SPARK-25789
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Aleksander Eskilson
>Priority: Major
>
> Support for Dataset of Avro records in an API that would allow the user to 
> provide a class to an {{Encoder}} for Avro, analogous to the {{Bean}} 
> encoder. This functionality was previously to be provided by SPARK-22739 and 
> [Spark-Avro #169|https://github.com/databricks/spark-avro/issues/169]. Avro 
> functionality was folded into Spark-proper by SPARK-24768, eliminating the 
> need to maintain a separate library for Avro in Spark. Resolution of this 
> issue would:
>  * Add necessary {{Expression}} elements to Spark
>  * Add an {{AvroEncoder}} for Datasets of Avro records to Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25789) Support for Dataset of Avro

2022-02-01 Thread Prashant Pandey (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485580#comment-17485580
 ] 

Prashant Pandey commented on SPARK-25789:
-

Is there any update on this ticket? We are facing this problem trying to use an 
Avro generated class to create a Dataset from a Dataframe.

> Support for Dataset of Avro
> ---
>
> Key: SPARK-25789
> URL: https://issues.apache.org/jira/browse/SPARK-25789
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Aleksander Eskilson
>Priority: Major
>
> Support for Dataset of Avro records in an API that would allow the user to 
> provide a class to an {{Encoder}} for Avro, analogous to the {{Bean}} 
> encoder. This functionality was previously to be provided by SPARK-22739 and 
> [Spark-Avro #169|https://github.com/databricks/spark-avro/issues/169]. Avro 
> functionality was folded into Spark-proper by SPARK-24768, eliminating the 
> need to maintain a separate library for Avro in Spark. Resolution of this 
> issue would:
>  * Add necessary {{Expression}} elements to Spark
>  * Add an {{AvroEncoder}} for Datasets of Avro records to Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38078) Aggregation with Watermark in AppendMode is holding data beyong water mark boundary.

2022-02-01 Thread krishna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

krishna updated SPARK-38078:

Description: 
 I am struggling with a unique issue. I am not sure if my understanding is 
wrong or this is a bug with spark.
 
 #  I am reading a stream from events hub/kafka ( Extract)
 #  Pivoting and Aggregating the above dataframe ( Transformation). This is a 
WATERMARKED aggregation.
 #  writing the aggregation to Console/Delta table in APPEND  mode with a 
Trigger . 

However, the most recently published message to event hub is not writing to 
console/delta even after falling out of the watermark time. 
 
 My understanding is the event should be inserted to  the Delta table after 
Eventtime+Watermark.
 

Moreover, all the events in the memory stored must be flushed out to the sink 
irrespective of the watermark before stopping to mark a graceful shutdown .

 

Please advise.

  was:
 I am struggling with a unique issue. I am not sure if my understanding is 
wrong or this is a bug with spark.
 
 #  I am reading a stream from events hub/kafka ( Extract)
 #  Pivoting and Aggregating the above dataframe ( Transformation). This is a 
WATERMARKED aggregation.
 #  writing the aggregation to Console/Delta table in APPEND  mode with a 
Trigger . 

However, the most recently published message to event hub is not writing to 
console/delta even after falling out of the watermark time. 
 
 My understanding is the event should be inserted to  the Delta table after 
Eventtime+Watermark.
 
Please advise.


> Aggregation with Watermark in AppendMode is holding data beyong water mark 
> boundary.
> 
>
> Key: SPARK-38078
> URL: https://issues.apache.org/jira/browse/SPARK-38078
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: krishna
>Priority: Major
>
>  I am struggling with a unique issue. I am not sure if my understanding is 
> wrong or this is a bug with spark.
>  
>  #  I am reading a stream from events hub/kafka ( Extract)
>  #  Pivoting and Aggregating the above dataframe ( Transformation). This is a 
> WATERMARKED aggregation.
>  #  writing the aggregation to Console/Delta table in APPEND  mode with a 
> Trigger . 
> However, the most recently published message to event hub is not writing to 
> console/delta even after falling out of the watermark time. 
>  
>  My understanding is the event should be inserted to  the Delta table after 
> Eventtime+Watermark.
>  
> Moreover, all the events in the memory stored must be flushed out to the sink 
> irrespective of the watermark before stopping to mark a graceful shutdown .
>  
> Please advise.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38059) Incorrect query ordering with flatMap() and distinct()

2022-02-01 Thread Vu Tan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485550#comment-17485550
 ] 

Vu Tan commented on SPARK-38059:


I ran your above  Java app and got the below result (on spark 3.2.0)
{code:java}
0,6
0,7
1,6
1,7 {code}
So I think it is working as expected. 

> Incorrect query ordering with flatMap() and distinct()
> --
>
> Key: SPARK-38059
> URL: https://issues.apache.org/jira/browse/SPARK-38059
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.0.2, 3.2.0
>Reporter: AJ Bousquet
>Priority: Major
>
> I have a Dataset of non-unique identifiers that I can use with 
> {{Dataset::flatMap()}} to create multiple rows with sub-identifiers for each 
> id. When I run the code below, the {{limit(2)}} call is placed _after_ the 
> call to {{flatMap()}} in the optimized logical plan. This unexpectedly yields 
> only 2 rows, when I would expect it to yield 6.
> {code:java}
> StructType idSchema = 
> DataTypes.createStructType(List.of(DataTypes.createStructField("id", 
> DataTypes.LongType, false)));
> StructType flatMapSchema = DataTypes.createStructType(List.of(
>     DataTypes.createStructField("id", DataTypes.LongType, false),
>     DataTypes.createStructField("subId", DataTypes.LongType, false)
> ));Dataset inputDataset = context.sparkSession().createDataset(
>     LongStream.range(0,5).mapToObj((id) -> 
> RowFactory.create(id)).collect(Collectors.toList()),
>     RowEncoder.apply(idSchema)
> );
> return inputDataset
>     .distinct()
>     .limit(2)
>     .flatMap((Row row) -> {
>         Long id = row.getLong(row.fieldIndex("id"));        return 
> LongStream.range(6,8).mapToObj((subid) -> RowFactory.create(id, 
> subid)).iterator();
>     }, RowEncoder.apply(flatMapSchema));  {code}
> When run, the above code produces something like:
> ||id||subID||
> |0|6|
> |0|7|
> But I would expect something like:
> ||id||subID||
> |1|6|
> |1|7|
> |1|8|
> |0|6|
> |0|7|
> |0|8|



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38086) Make ArrowColumnVector Extendable

2022-02-01 Thread Kazuyuki Tanimura (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485530#comment-17485530
 ] 

Kazuyuki Tanimura commented on SPARK-38086:
---

I am working on this

> Make ArrowColumnVector Extendable
> -
>
> Key: SPARK-38086
> URL: https://issues.apache.org/jira/browse/SPARK-38086
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> Some Spark extension libraries need to extend ArrowColumnVector.java. For 
> now, it is impossible as ArrowColumnVector class is final and the accessors 
> are all private.
> For example, Rapids copies the entire ArrowColumnVector class in order to 
> work around the issue
> [https://github.com/NVIDIA/spark-rapids/blob/main/sql-plugin/src/main/java/org/apache/spark/sql/vectorized/rapids/AccessibleArrowColumnVector.java]
> Proposing to relax private/final restrictions to make ArrowColumnVector 
> extendable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38086) Make ArrowColumnVector Extendable

2022-02-01 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-38086:
-

 Summary: Make ArrowColumnVector Extendable
 Key: SPARK-38086
 URL: https://issues.apache.org/jira/browse/SPARK-38086
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Kazuyuki Tanimura


Some Spark extension libraries need to extend ArrowColumnVector.java. For now, 
it is impossible as ArrowColumnVector class is final and the accessors are all 
private.

For example, Rapids copies the entire ArrowColumnVector class in order to work 
around the issue

[https://github.com/NVIDIA/spark-rapids/blob/main/sql-plugin/src/main/java/org/apache/spark/sql/vectorized/rapids/AccessibleArrowColumnVector.java]

Proposing to relax private/final restrictions to make ArrowColumnVector 
extendable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35801) SPIP: Row-level operations in Data Source V2

2022-02-01 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485508#comment-17485508
 ] 

L. C. Hsieh commented on SPARK-35801:
-

I think we can leave this open and put sub-tasks under this, like 
https://issues.apache.org/jira/browse/SPARK-34849.

> SPIP: Row-level operations in Data Source V2
> 
>
> Key: SPARK-35801
> URL: https://issues.apache.org/jira/browse/SPARK-35801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Major
>  Labels: SPIP
>
> Row-level operations such as UPDATE, DELETE, MERGE are becoming more and more 
> important for modern Big Data workflows. Use cases include but are not 
> limited to deleting a set of records for regulatory compliance, updating a 
> set of records to fix an issue in the ingestion pipeline, applying changes in 
> a transaction log to a fact table. Row-level operations allow users to easily 
> express their use cases that would otherwise require much more SQL. Common 
> patterns for updating partitions are to read, union, and overwrite or read, 
> diff, and append. Using commands like MERGE, these operations are easier to 
> express and can be more efficient to run.
> Hive supports [MERGE|https://blog.cloudera.com/update-hive-tables-easy-way/] 
> and Spark should implement similar support.
> SPIP: 
> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38047) Add OUTLIER_NO_FALLBACK executor roll policy

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38047:
--
Summary: Add OUTLIER_NO_FALLBACK executor roll policy  (was: Provide an 
option to only roll executors if they are outliers)

> Add OUTLIER_NO_FALLBACK executor roll policy
> 
>
> Key: SPARK-38047
> URL: https://issues.apache.org/jira/browse/SPARK-38047
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Alex Holmes
>Assignee: Alex Holmes
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently executor rolling will always kill one executor every 
> {{{}spark.kubernetes.executor.rollInterval{}}}. For some of the policies this 
> may not be optimal in cases where the executor metric isn't an outlier 
> compared to other executors. There is a cost associated with killing 
> executors (ramp-up time for new executors for example) which applications may 
> not want to incur for non-outlier executors.
>  
> This ticket would add the ability to only kill executors if they are 
> outliners.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38085) DataSource V2: Handle DELETE commands for group-based sources

2022-02-01 Thread Anton Okolnychyi (Jira)
Anton Okolnychyi created SPARK-38085:


 Summary: DataSource V2: Handle DELETE commands for group-based 
sources
 Key: SPARK-38085
 URL: https://issues.apache.org/jira/browse/SPARK-38085
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Anton Okolnychyi


As per SPARK-35801, we should handle DELETE statements for sources that can 
replace groups of data (e.g. partitions, files).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35801) SPIP: Row-level operations in Data Source V2

2022-02-01 Thread Anton Okolnychyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-35801:
-
Affects Version/s: 3.3.0
   (was: 3.2.0)

> SPIP: Row-level operations in Data Source V2
> 
>
> Key: SPARK-35801
> URL: https://issues.apache.org/jira/browse/SPARK-35801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Priority: Major
>  Labels: SPIP
>
> Row-level operations such as UPDATE, DELETE, MERGE are becoming more and more 
> important for modern Big Data workflows. Use cases include but are not 
> limited to deleting a set of records for regulatory compliance, updating a 
> set of records to fix an issue in the ingestion pipeline, applying changes in 
> a transaction log to a fact table. Row-level operations allow users to easily 
> express their use cases that would otherwise require much more SQL. Common 
> patterns for updating partitions are to read, union, and overwrite or read, 
> diff, and append. Using commands like MERGE, these operations are easier to 
> express and can be more efficient to run.
> Hive supports [MERGE|https://blog.cloudera.com/update-hive-tables-easy-way/] 
> and Spark should implement similar support.
> SPIP: 
> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35801) SPIP: Row-level operations in Data Source V2

2022-02-01 Thread Anton Okolnychyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485502#comment-17485502
 ] 

Anton Okolnychyi commented on SPARK-35801:
--

[~viirya], shall we keep this one open until the implementation is done or can 
we close it now? The community has already voted on this SPIP.

> SPIP: Row-level operations in Data Source V2
> 
>
> Key: SPARK-35801
> URL: https://issues.apache.org/jira/browse/SPARK-35801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Major
>  Labels: SPIP
>
> Row-level operations such as UPDATE, DELETE, MERGE are becoming more and more 
> important for modern Big Data workflows. Use cases include but are not 
> limited to deleting a set of records for regulatory compliance, updating a 
> set of records to fix an issue in the ingestion pipeline, applying changes in 
> a transaction log to a fact table. Row-level operations allow users to easily 
> express their use cases that would otherwise require much more SQL. Common 
> patterns for updating partitions are to read, union, and overwrite or read, 
> diff, and append. Using commands like MERGE, these operations are easier to 
> express and can be more efficient to run.
> Hive supports [MERGE|https://blog.cloudera.com/update-hive-tables-easy-way/] 
> and Spark should implement similar support.
> SPIP: 
> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35801) SPIP: Row-level operations in Data Source V2

2022-02-01 Thread Anton Okolnychyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-35801:
-
Labels: SPIP  (was: )

> SPIP: Row-level operations in Data Source V2
> 
>
> Key: SPARK-35801
> URL: https://issues.apache.org/jira/browse/SPARK-35801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Major
>  Labels: SPIP
>
> Row-level operations such as UPDATE, DELETE, MERGE are becoming more and more 
> important for modern Big Data workflows. Use cases include but are not 
> limited to deleting a set of records for regulatory compliance, updating a 
> set of records to fix an issue in the ingestion pipeline, applying changes in 
> a transaction log to a fact table. Row-level operations allow users to easily 
> express their use cases that would otherwise require much more SQL. Common 
> patterns for updating partitions are to read, union, and overwrite or read, 
> diff, and append. Using commands like MERGE, these operations are easier to 
> express and can be more efficient to run.
> Hive supports [MERGE|https://blog.cloudera.com/update-hive-tables-easy-way/] 
> and Spark should implement similar support.
> SPIP: 
> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38084) Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38084.
---
Fix Version/s: 3.3.0
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 35381
[https://github.com/apache/spark/pull/35381]

> Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`
> 
>
> Key: SPARK-38084
> URL: https://issues.apache.org/jira/browse/SPARK-38084
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38084) Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38084:
-

Assignee: Dongjoon Hyun

> Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`
> 
>
> Key: SPARK-38084
> URL: https://issues.apache.org/jira/browse/SPARK-38084
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38075) Hive script transform with order by and limit will return fake rows

2022-02-01 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-38075:
-
Fix Version/s: 3.1.3
   (was: 3.1.4)

> Hive script transform with order by and limit will return fake rows
> ---
>
> Key: SPARK-38075
> URL: https://issues.apache.org/jira/browse/SPARK-38075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
> Fix For: 3.1.3, 3.3.0, 3.2.2
>
>
> For example:
> {noformat}
> create or replace temp view t as
> select * from values
> (1),
> (2),
> (3)
> as t(a);
> select transform(a)
> USING 'cat' AS (a int)
> FROM t order by a limit 10;
> {noformat}
> This returns:
> {noformat}
> NULL
> NULL
> NULL
> 1
> 2
> 3
> {noformat}
> Without {{order by}} and {{limit}}, the query returns:
> {noformat}
> 1
> 2
> 3
> {noformat}
> Spark script transform does not have this issue. That is, if 
> {{spark.sql.catalogImplementation=in-memory}}, Spark does not return fake 
> rows.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24156) Enable no-data micro batches for more eager streaming state clean up

2022-02-01 Thread krishna (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483838#comment-17483838
 ] 

krishna edited comment on SPARK-24156 at 2/1/22, 9:04 PM:
--

Hi [~kcsrms] [~tdas] ,

  I am having the same issue. Is this issue resovled? is there a specific 
version I need to choose?

 
  I am struggling with a unique issue. I am not sure if my understanding is 
wrong or this is a bug with spark.
 
 #  I am reading a stream from events hub ( Extract)
 #  Pivoting and Aggregating the above dataframe ( Transformation). This is a 
WATERMARKED aggregation.
 #  writing the aggregation to Delta table in APPEND  mode with a Trigger . 

However, the most recently published message to event hub is not writing to 
delta even after falling out of the watermark time. 
 
 My understanding is the data should be inserted to the Delta table after 
Eventtime+Watermark.

 

Moreover, all the events in the memory stored must be flushed out to the sink 
before stopping to mark a graceful shutdown.
 
 


was (Author: JIRAUSER284389):
Hi [~kcsrms] [~tdas] ,

  I am having the same issue. Is this issue resovled? is there a specific 
version I need to choose?

 
  I am struggling with a unique issue. I am not sure if my understanding is 
wrong or this is a bug with spark.
 
 #  I am reading a stream from events hub ( Extract)
 #  Pivoting and Aggregating the above dataframe ( Transformation). This is a 
WATERMARKED aggregation.
 #  writing the aggregation to Delta table in APPEND  mode with a Trigger . 

However, the most recently published message to event hub is not writing to 
delta even after falling out of the watermark time. 
 
 My understanding is the data should be inserted to the Delta table after 
Eventtime+Watermark.
 
 

> Enable no-data micro batches for more eager streaming state clean up 
> -
>
> Key: SPARK-24156
> URL: https://issues.apache.org/jira/browse/SPARK-24156
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, MicroBatchExecution in Structured Streaming runs batches only when 
> there is new data to process. This is sensible in most cases as we dont want 
> to unnecessarily use resources when there is nothing new to process. However, 
> in some cases of stateful streaming queries, this delays state clean up as 
> well as clean-up based output. For example, consider a streaming aggregation 
> query with watermark-based state cleanup. The watermark is updated after 
> every batch with new data completes. The updated value is used in the next 
> batch to clean up state, and output finalized aggregates in append mode. 
> However, if there is no data, then the next batch does not occur, and 
> cleanup/output gets delayed unnecessarily. This is true for all stateful 
> streaming operators - aggregation, deduplication, joins, mapGroupsWithState
> This issue tracks the work to enable no-data batches in MicroBatchExecution. 
> The major challenge is that all the tests of relevant stateful operations add 
> dummy data to force another batch for testing the state cleanup. So a lot of 
> the tests are going to be changed. So my plan is to enable no-data batches 
> for different stateful operators one at a time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38075) Hive script transform with order by and limit will return fake rows

2022-02-01 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-38075:
-
Fix Version/s: 3.1.4
   (was: 3.1.3)
Affects Version/s: 3.1.3

> Hive script transform with order by and limit will return fake rows
> ---
>
> Key: SPARK-38075
> URL: https://issues.apache.org/jira/browse/SPARK-38075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
>
> For example:
> {noformat}
> create or replace temp view t as
> select * from values
> (1),
> (2),
> (3)
> as t(a);
> select transform(a)
> USING 'cat' AS (a int)
> FROM t order by a limit 10;
> {noformat}
> This returns:
> {noformat}
> NULL
> NULL
> NULL
> 1
> 2
> 3
> {noformat}
> Without {{order by}} and {{limit}}, the query returns:
> {noformat}
> 1
> 2
> 3
> {noformat}
> Spark script transform does not have this issue. That is, if 
> {{spark.sql.catalogImplementation=in-memory}}, Spark does not return fake 
> rows.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38084) Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`

2022-02-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485430#comment-17485430
 ] 

Apache Spark commented on SPARK-38084:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35381

> Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`
> 
>
> Key: SPARK-38084
> URL: https://issues.apache.org/jira/browse/SPARK-38084
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38084) Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`

2022-02-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38084:


Assignee: Apache Spark

> Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`
> 
>
> Key: SPARK-38084
> URL: https://issues.apache.org/jira/browse/SPARK-38084
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38084) Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`

2022-02-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38084:


Assignee: (was: Apache Spark)

> Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`
> 
>
> Key: SPARK-38084
> URL: https://issues.apache.org/jira/browse/SPARK-38084
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38084) Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`

2022-02-01 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-38084:
-

 Summary: Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`
 Key: SPARK-38084
 URL: https://issues.apache.org/jira/browse/SPARK-38084
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.3.0, 3.2.2
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38047) Provide an option to only roll executors if they are outliers

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38047.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35373
[https://github.com/apache/spark/pull/35373]

> Provide an option to only roll executors if they are outliers
> -
>
> Key: SPARK-38047
> URL: https://issues.apache.org/jira/browse/SPARK-38047
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Alex Holmes
>Assignee: Alex Holmes
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently executor rolling will always kill one executor every 
> {{{}spark.kubernetes.executor.rollInterval{}}}. For some of the policies this 
> may not be optimal in cases where the executor metric isn't an outlier 
> compared to other executors. There is a cost associated with killing 
> executors (ramp-up time for new executors for example) which applications may 
> not want to incur for non-outlier executors.
>  
> This ticket would add the ability to only kill executors if they are 
> outliners.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27826) saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" format issue

2022-02-01 Thread Bandhu Gupta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485393#comment-17485393
 ] 

Bandhu Gupta commented on SPARK-27826:
--

Hi Fengtlyer , We are facing exactly the same issue which you reported here. I 
wanted to understand as what are we missing here to resolve this issue. Please 
let me know what you understand from Hyukjin's comment on saveAsTable comment.

> saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" 
> format issue
> 
>
> Key: SPARK-27826
> URL: https://issues.apache.org/jira/browse/SPARK-27826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.4.0
> Environment: CDH 5.13.1 - Spark version 2.2.0.cloudera2
> CDH 6.1.1 - Spark version 2.4.0-cdh6.1.1
>Reporter: fengtlyer
>Priority: Minor
>
> Hi Spark Dev Team,
> We tested a few times and found this bug can reappearance in multi Spark 
> version
> We tested in CDH 5.13.1 - Spark version 2.2.0.cloudera2 and CDH 6.1.1 - Spark 
> version 2.4.0-cdh6.1.1
> Both of them have this bug:
> 1. If one table created by Impala or Hive in the HUE, then in Spark code, 
> "write.format("parquet").mode("append").saveAsTable()" will case the format 
> issue (see the below error log)
> 2. Hive/Impala in the HUE created table, then 
> "write.format("parquet").mode("overwrite").saveAsTable()", this code still 
> does not work.
>  2.1 Hive/Impala in the HUE created table, and 
> "write.format("parquet").mode("overwrite").saveAsTable()", then 
> "write.format("parquet").mode("append").saveAsTable()" can work.
> 3. Hive/Impala in the HUE created table, then "insertInto()" still will work.
>  3.1 Hive/Impala in the HUE created a table, and used "insertInto()" insert 
> some new record, then try to use 
> "write.format("parquet").mode("append").saveAsTable()", it will get the same 
> format error log
> 4. Created parquet table and insert some data by Hive shell, then 
> "write.format("parquet").mode("append").saveAsTable()" can insert data, but 
> spark only shows data which insert by spark, and Hive only show data which 
> hive insert.
> === 
> Error Log 
> ===
> {code}
> spark.read.format("csv").option("sep",",").option("header","true").load("hdfs:///temp1/test_paquettest.csv").write.format("parquet").mode("append").saveAsTable("parquet_test_table")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: The format of the existing table 
> default.parquet_test_table is `HiveFileFormat`. It doesn't match the 
> specified format `ParquetFileFormat`.;
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:115)
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75)
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
> at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
> at 
> 

[jira] [Updated] (SPARK-38083) set the amount of explained variance as parameter of pyspark.ml.feature.PCA

2022-02-01 Thread Nicola (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola updated SPARK-38083:
---
Summary: set the amount of explained variance as parameter of 
pyspark.ml.feature.PCA  (was: set the amout of explained variance as parameter 
of pyspark.ml.feature.PCA)

> set the amount of explained variance as parameter of pyspark.ml.feature.PCA
> ---
>
> Key: SPARK-38083
> URL: https://issues.apache.org/jira/browse/SPARK-38083
> Project: Spark
>  Issue Type: Wish
>  Components: ML, MLlib
>Affects Versions: 3.2.2
>Reporter: Nicola
>Priority: Major
>
> As in 
> [sklearn.decomposition.PCA|https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html],
>  where:
> if {{0 < n_components < 1}} select the number of components such that the 
> amount of variance that needs to be explained is greater than the percentage 
> specified by n_components
> it would be useful to have a similar behavior with the k parameter in 
> pyspark.ml.feature.PCA.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38083) set the amout of explained variance as parameter of pyspark.ml.feature.PCA

2022-02-01 Thread Nicola (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola updated SPARK-38083:
---
Affects Version/s: 3.2.2
   (was: 3.2.1)

> set the amout of explained variance as parameter of pyspark.ml.feature.PCA
> --
>
> Key: SPARK-38083
> URL: https://issues.apache.org/jira/browse/SPARK-38083
> Project: Spark
>  Issue Type: Wish
>  Components: ML, MLlib
>Affects Versions: 3.2.2
>Reporter: Nicola
>Priority: Major
>
> As in 
> [sklearn.decomposition.PCA|https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html],
>  where:
> if {{0 < n_components < 1}} select the number of components such that the 
> amount of variance that needs to be explained is greater than the percentage 
> specified by n_components
> it would be useful to have a similar behavior with the k parameter in 
> pyspark.ml.feature.PCA.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38083) set the amout of explained variance as parameter of pyspark.ml.feature.PCA

2022-02-01 Thread Nicola (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola updated SPARK-38083:
---
Component/s: ML

> set the amout of explained variance as parameter of pyspark.ml.feature.PCA
> --
>
> Key: SPARK-38083
> URL: https://issues.apache.org/jira/browse/SPARK-38083
> Project: Spark
>  Issue Type: Wish
>  Components: ML, MLlib
>Affects Versions: 3.2.1
>Reporter: Nicola
>Priority: Major
>
> As in 
> [sklearn.decomposition.PCA|https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html],
>  where:
> if {{0 < n_components < 1}} select the number of components such that the 
> amount of variance that needs to be explained is greater than the percentage 
> specified by n_components
> it would be useful to have a similar behavior with the k parameter in 
> pyspark.ml.feature.PCA.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38083) set the amout of explained variance as parameter of pyspark.ml.feature.PCA

2022-02-01 Thread Nicola (Jira)
Nicola created SPARK-38083:
--

 Summary: set the amout of explained variance as parameter of 
pyspark.ml.feature.PCA
 Key: SPARK-38083
 URL: https://issues.apache.org/jira/browse/SPARK-38083
 Project: Spark
  Issue Type: Wish
  Components: MLlib
Affects Versions: 3.2.1
Reporter: Nicola


As in 
[sklearn.decomposition.PCA|https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html],
 where:

if {{0 < n_components < 1}} select the number of components such that the 
amount of variance that needs to be explained is greater than the percentage 
specified by n_components

it would be useful to have a similar behavior with the k parameter in 
pyspark.ml.feature.PCA.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38081) Support cloud-backend in K8s IT with SBT

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38081.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35376
[https://github.com/apache/spark/pull/35376]

> Support cloud-backend in K8s IT with SBT
> 
>
> Key: SPARK-38081
> URL: https://issues.apache.org/jira/browse/SPARK-38081
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38081) Support cloud-backend in K8s IT with SBT

2022-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38081:
-

Assignee: Dongjoon Hyun

> Support cloud-backend in K8s IT with SBT
> 
>
> Key: SPARK-38081
> URL: https://issues.apache.org/jira/browse/SPARK-38081
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38082) Update minimum numpy version

2022-02-01 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-38082:
---
Description: 
Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.

However, 1.7 has been released almost 9 years ago and since then some methods 
that we use have been deprecated in favor of new additions and  anew API 
({{numpy.typing}}, that is of some interest to us, has been added.

We should update minimum version requirement to one of the following

- {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace 
deprecated {{tostring}} calls with {{tobytes}}.
- {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
minimum supported pandas version.
- {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.

The last one might be somewhat controversial, but 1.15 shouldn't require much 
discussion.



  was:
Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.

However, 1.7 has been released almost 9 years ago and since then some methods 
that we use have been deprecated in favor of new additions and  anew API 
({{numpy.typing}}, that is of some interest to us, has been added.

We should update minimum version requirement to:

- {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace 
deprecated {{tostring}} calls with {{tobytes}}.
- {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
minimum supported pandas version.
- {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.

The last one might be somewhat controversial, but 1.15 shouldn't require much 
discussion.




> Update minimum numpy version
> 
>
> Key: SPARK-38082
> URL: https://issues.apache.org/jira/browse/SPARK-38082
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.
> However, 1.7 has been released almost 9 years ago and since then some methods 
> that we use have been deprecated in favor of new additions and  anew API 
> ({{numpy.typing}}, that is of some interest to us, has been added.
> We should update minimum version requirement to one of the following
> - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to 
> replace deprecated {{tostring}} calls with {{tobytes}}.
> - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
> minimum supported pandas version.
> - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.
> The last one might be somewhat controversial, but 1.15 shouldn't require much 
> discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38082) Update minimum numpy version

2022-02-01 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-38082:
---
Description: 
Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.

However, 1.7 has been released almost 9 years old and since then, some methods 
that we use have been deprecated in favor of new additions, and new API 
({{numpy.typing}}, that is of some interest to us, has been added.

We should update minimum version requirement to:

- {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace 
deprecated {{tostring}} calls with {{tobytes}}.
- {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
minimum supported pandas version.
- {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.

The last one might be somewhat controversial, but 1.15 shouldn't require much 
discussion.



  was:
Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.

However, 1.7 has been released over almost 9 years old and since then, some 
methods that we use have been deprecated in favor of new additions, and new API 
({{numpy.typing}}, that is of some interest to us, has been added.

We should update minimum version requirement to:

- {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace 
deprecated {{tostring}} calls with {{tobytes}}.
- {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
minimum supported pandas version.
- {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.

The last one might be somewhat controversial, but 1.15 shouldn't require much 
discussion.




> Update minimum numpy version
> 
>
> Key: SPARK-38082
> URL: https://issues.apache.org/jira/browse/SPARK-38082
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.
> However, 1.7 has been released almost 9 years old and since then, some 
> methods that we use have been deprecated in favor of new additions, and new 
> API ({{numpy.typing}}, that is of some interest to us, has been added.
> We should update minimum version requirement to:
> - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to 
> replace deprecated {{tostring}} calls with {{tobytes}}.
> - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
> minimum supported pandas version.
> - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.
> The last one might be somewhat controversial, but 1.15 shouldn't require much 
> discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38082) Update minimum numpy version

2022-02-01 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-38082:
---
Description: 
Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.

However, 1.7 has been released almost 9 years ago and since then some methods 
that we use have been deprecated in favor of new additions and  anew API 
({{numpy.typing}}, that is of some interest to us, has been added.

We should update minimum version requirement to:

- {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace 
deprecated {{tostring}} calls with {{tobytes}}.
- {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
minimum supported pandas version.
- {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.

The last one might be somewhat controversial, but 1.15 shouldn't require much 
discussion.



  was:
Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.

However, 1.7 has been released almost 9 years ago and since then some methods 
that we use have been deprecated in favor of new additions, and new API 
({{numpy.typing}}, that is of some interest to us, has been added.

We should update minimum version requirement to:

- {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace 
deprecated {{tostring}} calls with {{tobytes}}.
- {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
minimum supported pandas version.
- {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.

The last one might be somewhat controversial, but 1.15 shouldn't require much 
discussion.




> Update minimum numpy version
> 
>
> Key: SPARK-38082
> URL: https://issues.apache.org/jira/browse/SPARK-38082
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.
> However, 1.7 has been released almost 9 years ago and since then some methods 
> that we use have been deprecated in favor of new additions and  anew API 
> ({{numpy.typing}}, that is of some interest to us, has been added.
> We should update minimum version requirement to:
> - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to 
> replace deprecated {{tostring}} calls with {{tobytes}}.
> - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
> minimum supported pandas version.
> - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.
> The last one might be somewhat controversial, but 1.15 shouldn't require much 
> discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38082) Update minimum numpy version

2022-02-01 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-38082:
---
Description: 
Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.

However, 1.7 has been released almost 9 years ago and since then some methods 
that we use have been deprecated in favor of new additions, and new API 
({{numpy.typing}}, that is of some interest to us, has been added.

We should update minimum version requirement to:

- {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace 
deprecated {{tostring}} calls with {{tobytes}}.
- {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
minimum supported pandas version.
- {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.

The last one might be somewhat controversial, but 1.15 shouldn't require much 
discussion.



  was:
Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.

However, 1.7 has been released almost 9 years old and since then, some methods 
that we use have been deprecated in favor of new additions, and new API 
({{numpy.typing}}, that is of some interest to us, has been added.

We should update minimum version requirement to:

- {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace 
deprecated {{tostring}} calls with {{tobytes}}.
- {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
minimum supported pandas version.
- {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.

The last one might be somewhat controversial, but 1.15 shouldn't require much 
discussion.




> Update minimum numpy version
> 
>
> Key: SPARK-38082
> URL: https://issues.apache.org/jira/browse/SPARK-38082
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.
> However, 1.7 has been released almost 9 years ago and since then some methods 
> that we use have been deprecated in favor of new additions, and new API 
> ({{numpy.typing}}, that is of some interest to us, has been added.
> We should update minimum version requirement to:
> - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to 
> replace deprecated {{tostring}} calls with {{tobytes}}.
> - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
> minimum supported pandas version.
> - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.
> The last one might be somewhat controversial, but 1.15 shouldn't require much 
> discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38082) Update minimum numpy version

2022-02-01 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485327#comment-17485327
 ] 

Maciej Szymkiewicz commented on SPARK-38082:


cc [~hyukjin.kwon] [~WeichenXu123] [~huaxingao]

> Update minimum numpy version
> 
>
> Key: SPARK-38082
> URL: https://issues.apache.org/jira/browse/SPARK-38082
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.
> However, 1.7 has been released over almost 9 years old and since then, some 
> methods that we use have been deprecated in favor of new additions, and new 
> API ({{numpy.typing}}, that is of some interest to us, has been added.
> We should update minimum version requirement to:
> - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to 
> replace deprecated {{tostring}} calls with {{tobytes}}.
> - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
> minimum supported pandas version.
> - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.
> The last one might be somewhat controversial, but 1.15 shouldn't require much 
> discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38082) Update minimum numpy version

2022-02-01 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-38082:
---
Description: 
Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.

However, 1.7 has been released over almost 9 years old and since then, some 
methods that we use have been deprecated in favor of new additions, and new API 
({{numpy.typing}}, that is of some interest to us, has been added.

We should update minimum version requirement to:

- {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace 
deprecated {{tostring}} calls with {{tobytes}}.
- {{>=1.15}} (released 2018-07-23) ‒ this is reasonable bound to match our 
minimum supported pandas version.
- {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.

The last one might be somewhat controversial, but 1.15 shouldn't require much 
discussion.



  was:
Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.

However, 1.7 has been released over almost 9 years old and since then, some 
methods that we use have been deprecated in favor of new additions, and new API 
({{numpy.typing}}, that is of some interest to us, has been added.

We should update minimum version requirement to:

- {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace 
deprecated {{tostring}} calls with {{tobytes}}.
- {{>=1.15}} (released 2018-07-23) ‒ this is reasonable bound to match our 
minimum supported pandas version.
- {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.

The last one might be somewhat controversial, but 1.15 should require much 
discussion.




> Update minimum numpy version
> 
>
> Key: SPARK-38082
> URL: https://issues.apache.org/jira/browse/SPARK-38082
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.
> However, 1.7 has been released over almost 9 years old and since then, some 
> methods that we use have been deprecated in favor of new additions, and new 
> API ({{numpy.typing}}, that is of some interest to us, has been added.
> We should update minimum version requirement to:
> - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to 
> replace deprecated {{tostring}} calls with {{tobytes}}.
> - {{>=1.15}} (released 2018-07-23) ‒ this is reasonable bound to match our 
> minimum supported pandas version.
> - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.
> The last one might be somewhat controversial, but 1.15 shouldn't require much 
> discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38082) Update minimum numpy version

2022-02-01 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-38082:
---
Description: 
Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.

However, 1.7 has been released over almost 9 years old and since then, some 
methods that we use have been deprecated in favor of new additions, and new API 
({{numpy.typing}}, that is of some interest to us, has been added.

We should update minimum version requirement to:

- {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace 
deprecated {{tostring}} calls with {{tobytes}}.
- {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
minimum supported pandas version.
- {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.

The last one might be somewhat controversial, but 1.15 shouldn't require much 
discussion.



  was:
Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.

However, 1.7 has been released over almost 9 years old and since then, some 
methods that we use have been deprecated in favor of new additions, and new API 
({{numpy.typing}}, that is of some interest to us, has been added.

We should update minimum version requirement to:

- {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace 
deprecated {{tostring}} calls with {{tobytes}}.
- {{>=1.15}} (released 2018-07-23) ‒ this is reasonable bound to match our 
minimum supported pandas version.
- {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.

The last one might be somewhat controversial, but 1.15 shouldn't require much 
discussion.




> Update minimum numpy version
> 
>
> Key: SPARK-38082
> URL: https://issues.apache.org/jira/browse/SPARK-38082
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.
> However, 1.7 has been released over almost 9 years old and since then, some 
> methods that we use have been deprecated in favor of new additions, and new 
> API ({{numpy.typing}}, that is of some interest to us, has been added.
> We should update minimum version requirement to:
> - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to 
> replace deprecated {{tostring}} calls with {{tobytes}}.
> - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our 
> minimum supported pandas version.
> - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.
> The last one might be somewhat controversial, but 1.15 shouldn't require much 
> discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38082) Update minimum numpy version

2022-02-01 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-38082:
--

 Summary: Update minimum numpy version
 Key: SPARK-38082
 URL: https://issues.apache.org/jira/browse/SPARK-38082
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Affects Versions: 3.3.0
Reporter: Maciej Szymkiewicz


Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}.

However, 1.7 has been released over almost 9 years old and since then, some 
methods that we use have been deprecated in favor of new additions, and new API 
({{numpy.typing}}, that is of some interest to us, has been added.

We should update minimum version requirement to:

- {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace 
deprecated {{tostring}} calls with {{tobytes}}.
- {{>=1.15}} (released 2018-07-23) ‒ this is reasonable bound to match our 
minimum supported pandas version.
- {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing.

The last one might be somewhat controversial, but 1.15 should require much 
discussion.





--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37417) Inline type hints for python/pyspark/ml/linalg/__init__.py

2022-02-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485313#comment-17485313
 ] 

Apache Spark commented on SPARK-37417:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/35380

> Inline type hints for python/pyspark/ml/linalg/__init__.py
> --
>
> Key: SPARK-37417
> URL: https://issues.apache.org/jira/browse/SPARK-37417
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/linalg/__init__.pyi to 
> python/pyspark/ml/linalg/__init__.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37417) Inline type hints for python/pyspark/ml/linalg/__init__.py

2022-02-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37417:


Assignee: Apache Spark

> Inline type hints for python/pyspark/ml/linalg/__init__.py
> --
>
> Key: SPARK-37417
> URL: https://issues.apache.org/jira/browse/SPARK-37417
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints from python/pyspark/ml/linalg/__init__.pyi to 
> python/pyspark/ml/linalg/__init__.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37417) Inline type hints for python/pyspark/ml/linalg/__init__.py

2022-02-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37417:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/ml/linalg/__init__.py
> --
>
> Key: SPARK-37417
> URL: https://issues.apache.org/jira/browse/SPARK-37417
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/linalg/__init__.pyi to 
> python/pyspark/ml/linalg/__init__.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37417) Inline type hints for python/pyspark/ml/linalg/__init__.py

2022-02-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485312#comment-17485312
 ] 

Apache Spark commented on SPARK-37417:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/35380

> Inline type hints for python/pyspark/ml/linalg/__init__.py
> --
>
> Key: SPARK-37417
> URL: https://issues.apache.org/jira/browse/SPARK-37417
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/linalg/__init__.pyi to 
> python/pyspark/ml/linalg/__init__.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37980) Extend METADATA column to support row indices for file based data sources

2022-02-01 Thread Cheng Lian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485215#comment-17485215
 ] 

Cheng Lian commented on SPARK-37980:


[~prakharjain09], as you've mentioned, it's not super straightforward to 
customize the Parquet code paths in Spark to achieve the goal. In the 
meanwhile, this functionality is in general quite useful. I can imagine it 
enabling other systems in the Parquet ecosystem to build more sophisticated 
indexing solutions. Instead of doing heavy customizations in Spark, would it be 
better if we can make the changes happen in upstream {{parquet-mr}} so that 
other systems can benefit from it more easily?

> Extend METADATA column to support row indices for file based data sources
> -
>
> Key: SPARK-37980
> URL: https://issues.apache.org/jira/browse/SPARK-37980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3
>Reporter: Prakhar Jain
>Priority: Major
>
> Spark recently added hidden metadata column support for File based 
> datasources as part of  SPARK-37273.
> We should extend it to support ROW_INDEX/ROW_POSITION also.
>  
> Meaning of  ROW_POSITION:
> ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th 
> row in the file will have ROW_INDEX 5.
>  
> Use cases: 
> Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple 
> uniquely identifies row in a table. This information can be used to mark rows 
> e.g. this can be used by indexer etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38067) Inconsistent missing values handling in Pandas on Spark to_json

2022-02-01 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-38067:
--

Assignee: Bjørn Jørgensen

> Inconsistent missing values handling in Pandas on Spark to_json
> ---
>
> Key: SPARK-38067
> URL: https://issues.apache.org/jira/browse/SPARK-38067
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>
> If {{ps.DataFrame.to_json}} is called without {{path}} argument, missing 
> values are written explicitly 
> {code:python}
> import pandas as pd
> import pyspark.pandas as ps
> pdf = pd.DataFrame({"id": [1, 2, 3], "value": [None, 3, None]})
> psf = ps.from_pandas(pdf)
> psf.to_json()
> ## '[{"id":1,"value":null},{"id":2,"value":3.0},{"id":3,"value":null}]'
> {code:python}
> This behavior is consistent with Pandas:
> {code:python}
> pdf.to_json()
> ## '{"id":{"0":1,"1":2,"2":3},"value":{"0":null,"1":3.0,"2":null}}'
> {code}
> However, if {{path}} is provided, missing values are omitted by default:
> {code:python}
> import tempfile
> path = tempfile.mktemp()
> psf.to_json(path)
> spark.read.text(path).show()
> ## ++
> ## |   value|
> ## ++
> ## |{"id":2,"value":3.0}|
> ## |{"id":3}|
> ## |{"id":1}|
> ## ++
> {code}
> We should set {{ignoreNullFields}} for Pandas API, to be `False` by default, 
> so both cases handle missing values in the same way.
> {code:python}
> psf.to_json(path, ignoreNullFields=False)
> spark.read.text(path).show(truncate=False)
> ## +-+
> ## |value|
> ## +-+
> ## |{"id":3,"value":null}|
> ## |{"id":1,"value":null}|
> ## |{"id":2,"value":3.0} |
> ## +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38067) Inconsistent missing values handling in Pandas on Spark to_json

2022-02-01 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-38067.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35296
[https://github.com/apache/spark/pull/35296]

> Inconsistent missing values handling in Pandas on Spark to_json
> ---
>
> Key: SPARK-38067
> URL: https://issues.apache.org/jira/browse/SPARK-38067
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.3.0
>
>
> If {{ps.DataFrame.to_json}} is called without {{path}} argument, missing 
> values are written explicitly 
> {code:python}
> import pandas as pd
> import pyspark.pandas as ps
> pdf = pd.DataFrame({"id": [1, 2, 3], "value": [None, 3, None]})
> psf = ps.from_pandas(pdf)
> psf.to_json()
> ## '[{"id":1,"value":null},{"id":2,"value":3.0},{"id":3,"value":null}]'
> {code:python}
> This behavior is consistent with Pandas:
> {code:python}
> pdf.to_json()
> ## '{"id":{"0":1,"1":2,"2":3},"value":{"0":null,"1":3.0,"2":null}}'
> {code}
> However, if {{path}} is provided, missing values are omitted by default:
> {code:python}
> import tempfile
> path = tempfile.mktemp()
> psf.to_json(path)
> spark.read.text(path).show()
> ## ++
> ## |   value|
> ## ++
> ## |{"id":2,"value":3.0}|
> ## |{"id":3}|
> ## |{"id":1}|
> ## ++
> {code}
> We should set {{ignoreNullFields}} for Pandas API, to be `False` by default, 
> so both cases handle missing values in the same way.
> {code:python}
> psf.to_json(path, ignoreNullFields=False)
> spark.read.text(path).show(truncate=False)
> ## +-+
> ## |value|
> ## +-+
> ## |{"id":3,"value":null}|
> ## |{"id":1,"value":null}|
> ## |{"id":2,"value":3.0} |
> ## +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38067) Inconsistent missing values handling in Pandas on Spark to_json

2022-02-01 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-38067:
---
Description: 
If {{ps.DataFrame.to_json}} is called without {{path}} argument, missing values 
are written explicitly 

{code:python}
import pandas as pd
import pyspark.pandas as ps

pdf = pd.DataFrame({"id": [1, 2, 3], "value": [None, 3, None]})
psf = ps.from_pandas(pdf)
psf.to_json()
## '[{"id":1,"value":null},{"id":2,"value":3.0},{"id":3,"value":null}]'
{code:python}

This behavior is consistent with Pandas:

{code:python}
pdf.to_json()
## '{"id":{"0":1,"1":2,"2":3},"value":{"0":null,"1":3.0,"2":null}}'
{code}

However, if {{path}} is provided, missing values are omitted by default:


{code:python}
import tempfile

path = tempfile.mktemp()
psf.to_json(path)

spark.read.text(path).show()
## ++
## |   value|
## ++
## |{"id":2,"value":3.0}|
## |{"id":3}|
## |{"id":1}|
## ++
{code}


We should set {{ignoreNullFields}} for Pandas API, to be `False` by default, so 
both cases handle missing values in the same way.


{code:python}
psf.to_json(path, ignoreNullFields=False)
spark.read.text(path).show(truncate=False)


## +-+
## |value|
## +-+
## |{"id":3,"value":null}|
## |{"id":1,"value":null}|
## |{"id":2,"value":3.0} |
## +-+
{code}




  was:
With pandas

{code:java}
data = {'col_1': [3, 2, 1, 0], 'col_2': [None, None, None, None]}
test_pd = pd.DataFrame.from_dict(data)
test_pd.shape

{code}
(4, 2)


{code:java}
test_pd.to_json("testpd.json")

test_pd2 = pd.read_json("testpd.json")
test_pd2.shape

{code}
(4, 2)

Pandas on spark API does delete the column that has all values Null.

{code:java}
data = {'col_1': [3, 2, 1, 0], 'col_2': [None, None, None, None]}
test_ps = ps.DataFrame.from_dict(data)
test_ps.shape

{code}
(4, 2)


{code:java}
test_ps.to_json("testps.json")
test_ps2 = ps.read_json("testps.json/*")
test_ps2.shape

{code}
(4, 1)

We need to change this to make pandas on spark API be more like pandas.

I have opened a PR for this.





> Inconsistent missing values handling in Pandas on Spark to_json
> ---
>
> Key: SPARK-38067
> URL: https://issues.apache.org/jira/browse/SPARK-38067
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> If {{ps.DataFrame.to_json}} is called without {{path}} argument, missing 
> values are written explicitly 
> {code:python}
> import pandas as pd
> import pyspark.pandas as ps
> pdf = pd.DataFrame({"id": [1, 2, 3], "value": [None, 3, None]})
> psf = ps.from_pandas(pdf)
> psf.to_json()
> ## '[{"id":1,"value":null},{"id":2,"value":3.0},{"id":3,"value":null}]'
> {code:python}
> This behavior is consistent with Pandas:
> {code:python}
> pdf.to_json()
> ## '{"id":{"0":1,"1":2,"2":3},"value":{"0":null,"1":3.0,"2":null}}'
> {code}
> However, if {{path}} is provided, missing values are omitted by default:
> {code:python}
> import tempfile
> path = tempfile.mktemp()
> psf.to_json(path)
> spark.read.text(path).show()
> ## ++
> ## |   value|
> ## ++
> ## |{"id":2,"value":3.0}|
> ## |{"id":3}|
> ## |{"id":1}|
> ## ++
> {code}
> We should set {{ignoreNullFields}} for Pandas API, to be `False` by default, 
> so both cases handle missing values in the same way.
> {code:python}
> psf.to_json(path, ignoreNullFields=False)
> spark.read.text(path).show(truncate=False)
> ## +-+
> ## |value|
> ## +-+
> ## |{"id":3,"value":null}|
> ## |{"id":1,"value":null}|
> ## |{"id":2,"value":3.0} |
> ## +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38067) Inconsistent missing values handling in Pandas on Spark to_json

2022-02-01 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-38067:
---
Summary: Inconsistent missing values handling in Pandas on Spark to_json  
(was: Pandas on spark deletes columns with all None as default.)

> Inconsistent missing values handling in Pandas on Spark to_json
> ---
>
> Key: SPARK-38067
> URL: https://issues.apache.org/jira/browse/SPARK-38067
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> With pandas
> {code:java}
> data = {'col_1': [3, 2, 1, 0], 'col_2': [None, None, None, None]}
> test_pd = pd.DataFrame.from_dict(data)
> test_pd.shape
> {code}
> (4, 2)
> {code:java}
> test_pd.to_json("testpd.json")
> test_pd2 = pd.read_json("testpd.json")
> test_pd2.shape
> {code}
> (4, 2)
> Pandas on spark API does delete the column that has all values Null.
> {code:java}
> data = {'col_1': [3, 2, 1, 0], 'col_2': [None, None, None, None]}
> test_ps = ps.DataFrame.from_dict(data)
> test_ps.shape
> {code}
> (4, 2)
> {code:java}
> test_ps.to_json("testps.json")
> test_ps2 = ps.read_json("testps.json/*")
> test_ps2.shape
> {code}
> (4, 1)
> We need to change this to make pandas on spark API be more like pandas.
> I have opened a PR for this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org