[jira] [Assigned] (SPARK-38076) Remove redundant null-check is covered by further condition
[ https://issues.apache.org/jira/browse/SPARK-38076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38076: - Assignee: Yang Jie > Remove redundant null-check is covered by further condition > --- > > Key: SPARK-38076 > URL: https://issues.apache.org/jira/browse/SPARK-38076 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > There are many code pattern as follows: > {code:java} > obj != null && obj instanceof SomeClass {code} > the null-check is redundant as instanceof operator implies non-nullity. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38076) Remove redundant null-check is covered by further condition
[ https://issues.apache.org/jira/browse/SPARK-38076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38076. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35369 [https://github.com/apache/spark/pull/35369] > Remove redundant null-check is covered by further condition > --- > > Key: SPARK-38076 > URL: https://issues.apache.org/jira/browse/SPARK-38076 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > > There are many code pattern as follows: > {code:java} > obj != null && obj instanceof SomeClass {code} > the null-check is redundant as instanceof operator implies non-nullity. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38080) Flaky test: StreamingQueryManagerSuite: 'awaitAnyTermination with timeout and resetTerminated'
[ https://issues.apache.org/jira/browse/SPARK-38080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38080: - Assignee: Shixiong Zhu > Flaky test: StreamingQueryManagerSuite: 'awaitAnyTermination with timeout and > resetTerminated' > -- > > Key: SPARK-38080 > URL: https://issues.apache.org/jira/browse/SPARK-38080 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Affects Versions: 3.2.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > {code:java} > [info] - awaitAnyTermination with timeout and resetTerminated *** FAILED *** > (14 seconds, 304 milliseconds) > [info] Did not throw SparkException when expected. Expected exception > org.apache.spark.sql.streaming.StreamingQueryException to be thrown, but > org.scalatest.exceptions.TestFailedException was thrown (StreamTest.scala:935) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > [info] at > org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563) > [info] at org.scalatest.Assertions.intercept(Assertions.scala:756) > [info] at org.scalatest.Assertions.intercept$(Assertions.scala:746) > [info] at > org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1563) > [info] at > org.apache.spark.sql.streaming.StreamTest$AwaitTerminationTester$.$anonfun$test$4(StreamTest.scala:935) > [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) > [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) > [info] at > org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1563) > [info] at > org.apache.spark.sql.streaming.StreamTest$AwaitTerminationTester$.test(StreamTest.scala:935) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.testAwaitAnyTermination(StreamingQueryManagerSuite.scala:445) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$10(StreamingQueryManagerSuite.scala:221) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$10$adapted(StreamingQueryManagerSuite.scala:140) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$withQueriesOn$1(StreamingQueryManagerSuite.scala:421) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127) > [info] at > org.scalatest.concurrent.TimeLimits.failAfterImpl(TimeLimits.scala:239) > [info] at > org.scalatest.concurrent.TimeLimits.failAfterImpl$(TimeLimits.scala:233) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfterImpl(StreamingQueryManagerSuite.scala:39) > [info] at > org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:230) > [info] at > org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:229) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfter(StreamingQueryManagerSuite.scala:39) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.withQueriesOn(StreamingQueryManagerSuite.scala:397) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$8(StreamingQueryManagerSuite.scala:140) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at > org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) > [info] at > org.apache.spark.sql.test.SQLTestUtils.$anonfun$testQuietly$1(SQLTestUtils.scala:115) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) >
[jira] [Resolved] (SPARK-38080) Flaky test: StreamingQueryManagerSuite: 'awaitAnyTermination with timeout and resetTerminated'
[ https://issues.apache.org/jira/browse/SPARK-38080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38080. --- Fix Version/s: 3.3.0 3.2.2 Resolution: Fixed Issue resolved by pull request 35372 [https://github.com/apache/spark/pull/35372] > Flaky test: StreamingQueryManagerSuite: 'awaitAnyTermination with timeout and > resetTerminated' > -- > > Key: SPARK-38080 > URL: https://issues.apache.org/jira/browse/SPARK-38080 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Affects Versions: 3.2.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.3.0, 3.2.2 > > > {code:java} > [info] - awaitAnyTermination with timeout and resetTerminated *** FAILED *** > (14 seconds, 304 milliseconds) > [info] Did not throw SparkException when expected. Expected exception > org.apache.spark.sql.streaming.StreamingQueryException to be thrown, but > org.scalatest.exceptions.TestFailedException was thrown (StreamTest.scala:935) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > [info] at > org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563) > [info] at org.scalatest.Assertions.intercept(Assertions.scala:756) > [info] at org.scalatest.Assertions.intercept$(Assertions.scala:746) > [info] at > org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1563) > [info] at > org.apache.spark.sql.streaming.StreamTest$AwaitTerminationTester$.$anonfun$test$4(StreamTest.scala:935) > [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) > [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) > [info] at > org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1563) > [info] at > org.apache.spark.sql.streaming.StreamTest$AwaitTerminationTester$.test(StreamTest.scala:935) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.testAwaitAnyTermination(StreamingQueryManagerSuite.scala:445) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$10(StreamingQueryManagerSuite.scala:221) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$10$adapted(StreamingQueryManagerSuite.scala:140) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$withQueriesOn$1(StreamingQueryManagerSuite.scala:421) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127) > [info] at > org.scalatest.concurrent.TimeLimits.failAfterImpl(TimeLimits.scala:239) > [info] at > org.scalatest.concurrent.TimeLimits.failAfterImpl$(TimeLimits.scala:233) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfterImpl(StreamingQueryManagerSuite.scala:39) > [info] at > org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:230) > [info] at > org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:229) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfter(StreamingQueryManagerSuite.scala:39) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.withQueriesOn(StreamingQueryManagerSuite.scala:397) > [info] at > org.apache.spark.sql.streaming.StreamingQueryManagerSuite.$anonfun$new$8(StreamingQueryManagerSuite.scala:140) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at > org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) > [info] at > org.apache.spark.sql.test.SQLTestUtils.$anonfun$testQuietly$1(SQLTestUtils.scala:115) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)
[jira] [Resolved] (SPARK-37397) Inline type hints for python/pyspark/ml/base.py
[ https://issues.apache.org/jira/browse/SPARK-37397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37397. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35289 [https://github.com/apache/spark/pull/35289] > Inline type hints for python/pyspark/ml/base.py > --- > > Key: SPARK-37397 > URL: https://issues.apache.org/jira/browse/SPARK-37397 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > Inline type hints from python/pyspark/ml/base.pyi to > python/pyspark/ml/base.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37397) Inline type hints for python/pyspark/ml/base.py
[ https://issues.apache.org/jira/browse/SPARK-37397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37397: - Assignee: Maciej Szymkiewicz > Inline type hints for python/pyspark/ml/base.py > --- > > Key: SPARK-37397 > URL: https://issues.apache.org/jira/browse/SPARK-37397 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/ml/base.pyi to > python/pyspark/ml/base.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce
[ https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38013: -- Component/s: Tests > AQE can change bhj to smj if no extra shuffle introduce > --- > > Key: SPARK-38013 > URL: https://issues.apache.org/jira/browse/SPARK-38013 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.3.0, 3.2.2 > > > An example to reproduce the bug. > {code:java} > create table t1 as select 1 c1, 2 c2; > create table t2 as select 1 c1, 2 c2; > create table t3 as select 1 c1, 2 c2; > set spark.sql.adaptive.autoBroadcastJoinThreshold=-1; > select /*+ merge(t3) */ * from t1 > left join ( > select c1 as c from t3 > ) t3 on t1.c1 = t3.c > left join ( > select /*+ repartition(c1) */ c1 from t2 > ) t2 on t1.c1 = t2.c1; > {code} > The key to produce this bug is that a bhj convert to smj/shj without > introducing extra shuffe and AQE does not think the join can be planned as > bhj. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce
[ https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-38013: --- > AQE can change bhj to smj if no extra shuffle introduce > --- > > Key: SPARK-38013 > URL: https://issues.apache.org/jira/browse/SPARK-38013 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > An example to reproduce the bug. > {code:java} > create table t1 as select 1 c1, 2 c2; > create table t2 as select 1 c1, 2 c2; > create table t3 as select 1 c1, 2 c2; > set spark.sql.adaptive.autoBroadcastJoinThreshold=-1; > select /*+ merge(t3) */ * from t1 > left join ( > select c1 as c from t3 > ) t3 on t1.c1 = t3.c > left join ( > select /*+ repartition(c1) */ c1 from t2 > ) t2 on t1.c1 = t2.c1; > {code} > The key to produce this bug is that a bhj convert to smj/shj without > introducing extra shuffe and AQE does not think the join can be planned as > bhj. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce
[ https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38013. --- Fix Version/s: 3.3.0 3.2.2 Resolution: Fixed This JIRA added a new test coverage. > AQE can change bhj to smj if no extra shuffle introduce > --- > > Key: SPARK-38013 > URL: https://issues.apache.org/jira/browse/SPARK-38013 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.3.0, 3.2.2 > > > An example to reproduce the bug. > {code:java} > create table t1 as select 1 c1, 2 c2; > create table t2 as select 1 c1, 2 c2; > create table t3 as select 1 c1, 2 c2; > set spark.sql.adaptive.autoBroadcastJoinThreshold=-1; > select /*+ merge(t3) */ * from t1 > left join ( > select c1 as c from t3 > ) t3 on t1.c1 = t3.c > left join ( > select /*+ repartition(c1) */ c1 from t2 > ) t2 on t1.c1 = t2.c1; > {code} > The key to produce this bug is that a bhj convert to smj/shj without > introducing extra shuffe and AQE does not think the join can be planned as > bhj. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38013) AQE can change bhj to smj if no extra shuffle introduce
[ https://issues.apache.org/jira/browse/SPARK-38013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38013: - Assignee: XiDuo You > AQE can change bhj to smj if no extra shuffle introduce > --- > > Key: SPARK-38013 > URL: https://issues.apache.org/jira/browse/SPARK-38013 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > > An example to reproduce the bug. > {code:java} > create table t1 as select 1 c1, 2 c2; > create table t2 as select 1 c1, 2 c2; > create table t3 as select 1 c1, 2 c2; > set spark.sql.adaptive.autoBroadcastJoinThreshold=-1; > select /*+ merge(t3) */ * from t1 > left join ( > select c1 as c from t3 > ) t3 on t1.c1 = t3.c > left join ( > select /*+ repartition(c1) */ c1 from t2 > ) t2 on t1.c1 = t2.c1; > {code} > The key to produce this bug is that a bhj convert to smj/shj without > introducing extra shuffe and AQE does not think the join can be planned as > bhj. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38087) select doesnt validate if the column already exists
[ https://issues.apache.org/jira/browse/SPARK-38087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deepa Vasanthkumar updated SPARK-38087: --- Description: Select doesnt validate whether the alias column is already present in the dataframe. After which, we cannot do anything in that dataframe on that column. df4 = df2.select(df2.firstname, df2.lastname) --> throws analysis exception df4.show() However drop will not let you drop the said column. Scenario to reproduce : df2 = df1.select("*", (df1.firstname).alias("firstname")) ---> this will add same column df2.show() df2.drop(df2.firstname) --> this will give AnalysisException: Reference 'firstname' is ambiguous, could be: firstname, firstname. Is this expected behavior . !select vs drop.png! !image-2022-02-02-06-28-23-543.png! was: Select doesnt validate whether the alias column is already present in the dataframe. However drop will not let you drop the said column. Scenario to reproduce : df2 = df1.select("*", (df1.firstname).alias("firstname")) ---> this will add same column df2.show() df2.drop(df2.firstname) --> this will give AnalysisException: Reference 'firstname' is ambiguous, could be: firstname, firstname. Is this expected behavior . !select vs drop.png! !image-2022-02-02-06-28-23-543.png! > select doesnt validate if the column already exists > --- > > Key: SPARK-38087 > URL: https://issues.apache.org/jira/browse/SPARK-38087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 > Environment: Version{{{}v3.2.1{}}} > {{}} > {{{}{}}}Master{{{}local[*]{}}} > {{(Reproducible in any environment)}} >Reporter: Deepa Vasanthkumar >Priority: Minor > Fix For: 3.3 > > Attachments: select vs drop.png > > > > Select doesnt validate whether the alias column is already present in the > dataframe. > After which, we cannot do anything in that dataframe on that column. > df4 = df2.select(df2.firstname, df2.lastname) --> throws analysis exception > df4.show() > > However drop will not let you drop the said column. > > Scenario to reproduce : > df2 = df1.select("*", (df1.firstname).alias("firstname")) ---> this will > add same column > df2.show() > df2.drop(df2.firstname) --> this will give AnalysisException: Reference > 'firstname' is ambiguous, could be: firstname, firstname. > > > Is this expected behavior . > !select vs drop.png! > !image-2022-02-02-06-28-23-543.png! > > > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38087) select doesnt validate if the column already exists
[ https://issues.apache.org/jira/browse/SPARK-38087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deepa Vasanthkumar updated SPARK-38087: --- Attachment: select vs drop.png > select doesnt validate if the column already exists > --- > > Key: SPARK-38087 > URL: https://issues.apache.org/jira/browse/SPARK-38087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 > Environment: Version{{{}v3.2.1{}}} > {{}} > {{{}{}}}Master{{{}local[*]{}}} > {{(Reproducible in any environment)}} >Reporter: Deepa Vasanthkumar >Priority: Minor > Fix For: 3.3 > > Attachments: select vs drop.png > > > > Select doesnt validate whether the alias column is already present in the > dataframe. > However drop will not let you drop the said column. > > Scenario to reproduce : > df2 = df1.select("*", (df1.firstname).alias("firstname")) ---> this will > add same column > df2.show() > df2.drop(df2.firstname) --> this will give AnalysisException: Reference > 'firstname' is ambiguous, could be: firstname, firstname. > > Is this expected behavior . > > > !image-2022-02-02-06-28-23-543.png! > > > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38087) select doesnt validate if the column already exists
[ https://issues.apache.org/jira/browse/SPARK-38087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deepa Vasanthkumar updated SPARK-38087: --- Description: Select doesnt validate whether the alias column is already present in the dataframe. However drop will not let you drop the said column. Scenario to reproduce : df2 = df1.select("*", (df1.firstname).alias("firstname")) ---> this will add same column df2.show() df2.drop(df2.firstname) --> this will give AnalysisException: Reference 'firstname' is ambiguous, could be: firstname, firstname. Is this expected behavior . !select vs drop.png! !image-2022-02-02-06-28-23-543.png! was: Select doesnt validate whether the alias column is already present in the dataframe. However drop will not let you drop the said column. Scenario to reproduce : df2 = df1.select("*", (df1.firstname).alias("firstname")) ---> this will add same column df2.show() df2.drop(df2.firstname) --> this will give AnalysisException: Reference 'firstname' is ambiguous, could be: firstname, firstname. Is this expected behavior . !image-2022-02-02-06-28-23-543.png! > select doesnt validate if the column already exists > --- > > Key: SPARK-38087 > URL: https://issues.apache.org/jira/browse/SPARK-38087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 > Environment: Version{{{}v3.2.1{}}} > {{}} > {{{}{}}}Master{{{}local[*]{}}} > {{(Reproducible in any environment)}} >Reporter: Deepa Vasanthkumar >Priority: Minor > Fix For: 3.3 > > Attachments: select vs drop.png > > > > Select doesnt validate whether the alias column is already present in the > dataframe. > However drop will not let you drop the said column. > > Scenario to reproduce : > df2 = df1.select("*", (df1.firstname).alias("firstname")) ---> this will > add same column > df2.show() > df2.drop(df2.firstname) --> this will give AnalysisException: Reference > 'firstname' is ambiguous, could be: firstname, firstname. > > > Is this expected behavior . > !select vs drop.png! > !image-2022-02-02-06-28-23-543.png! > > > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38087) select doesnt validate if the column already exists
Deepa Vasanthkumar created SPARK-38087: -- Summary: select doesnt validate if the column already exists Key: SPARK-38087 URL: https://issues.apache.org/jira/browse/SPARK-38087 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.1 Environment: Version{{{}v3.2.1{}}} {{}} {{{}{}}}Master{{{}local[*]{}}} {{(Reproducible in any environment)}} Reporter: Deepa Vasanthkumar Fix For: 3.3 Select doesnt validate whether the alias column is already present in the dataframe. However drop will not let you drop the said column. Scenario to reproduce : df2 = df1.select("*", (df1.firstname).alias("firstname")) ---> this will add same column df2.show() df2.drop(df2.firstname) --> this will give AnalysisException: Reference 'firstname' is ambiguous, could be: firstname, firstname. Is this expected behavior . !image-2022-02-02-06-28-23-543.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25789) Support for Dataset of Avro
[ https://issues.apache.org/jira/browse/SPARK-25789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485580#comment-17485580 ] Prashant Pandey edited comment on SPARK-25789 at 2/2/22, 5:51 AM: -- Is there any update on this ticket? We are facing this problem trying to use an Avro generated class to create a Dataset from a Dataframe. "{{{}UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema"{}}} [https://stackoverflow.com/questions/70950967/circular-reference-in-bean-class-while-creating-a-dataset-from-an-avro-generated] was (Author: pandepra): Is there any update on this ticket? We are facing this problem trying to use an Avro generated class to create a Dataset from a Dataframe. https://stackoverflow.com/questions/70950967/circular-reference-in-bean-class-while-creating-a-dataset-from-an-avro-generated > Support for Dataset of Avro > --- > > Key: SPARK-25789 > URL: https://issues.apache.org/jira/browse/SPARK-25789 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Aleksander Eskilson >Priority: Major > > Support for Dataset of Avro records in an API that would allow the user to > provide a class to an {{Encoder}} for Avro, analogous to the {{Bean}} > encoder. This functionality was previously to be provided by SPARK-22739 and > [Spark-Avro #169|https://github.com/databricks/spark-avro/issues/169]. Avro > functionality was folded into Spark-proper by SPARK-24768, eliminating the > need to maintain a separate library for Avro in Spark. Resolution of this > issue would: > * Add necessary {{Expression}} elements to Spark > * Add an {{AvroEncoder}} for Datasets of Avro records to Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25789) Support for Dataset of Avro
[ https://issues.apache.org/jira/browse/SPARK-25789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485580#comment-17485580 ] Prashant Pandey edited comment on SPARK-25789 at 2/2/22, 5:50 AM: -- Is there any update on this ticket? We are facing this problem trying to use an Avro generated class to create a Dataset from a Dataframe. https://stackoverflow.com/questions/70950967/circular-reference-in-bean-class-while-creating-a-dataset-from-an-avro-generated was (Author: pandepra): Is there any update on this ticket? We are facing this problem trying to use an Avro generated class to create a Dataset from a Dataframe. > Support for Dataset of Avro > --- > > Key: SPARK-25789 > URL: https://issues.apache.org/jira/browse/SPARK-25789 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Aleksander Eskilson >Priority: Major > > Support for Dataset of Avro records in an API that would allow the user to > provide a class to an {{Encoder}} for Avro, analogous to the {{Bean}} > encoder. This functionality was previously to be provided by SPARK-22739 and > [Spark-Avro #169|https://github.com/databricks/spark-avro/issues/169]. Avro > functionality was folded into Spark-proper by SPARK-24768, eliminating the > need to maintain a separate library for Avro in Spark. Resolution of this > issue would: > * Add necessary {{Expression}} elements to Spark > * Add an {{AvroEncoder}} for Datasets of Avro records to Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25789) Support for Dataset of Avro
[ https://issues.apache.org/jira/browse/SPARK-25789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485580#comment-17485580 ] Prashant Pandey commented on SPARK-25789: - Is there any update on this ticket? We are facing this problem trying to use an Avro generated class to create a Dataset from a Dataframe. > Support for Dataset of Avro > --- > > Key: SPARK-25789 > URL: https://issues.apache.org/jira/browse/SPARK-25789 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Aleksander Eskilson >Priority: Major > > Support for Dataset of Avro records in an API that would allow the user to > provide a class to an {{Encoder}} for Avro, analogous to the {{Bean}} > encoder. This functionality was previously to be provided by SPARK-22739 and > [Spark-Avro #169|https://github.com/databricks/spark-avro/issues/169]. Avro > functionality was folded into Spark-proper by SPARK-24768, eliminating the > need to maintain a separate library for Avro in Spark. Resolution of this > issue would: > * Add necessary {{Expression}} elements to Spark > * Add an {{AvroEncoder}} for Datasets of Avro records to Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38078) Aggregation with Watermark in AppendMode is holding data beyong water mark boundary.
[ https://issues.apache.org/jira/browse/SPARK-38078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] krishna updated SPARK-38078: Description: I am struggling with a unique issue. I am not sure if my understanding is wrong or this is a bug with spark. # I am reading a stream from events hub/kafka ( Extract) # Pivoting and Aggregating the above dataframe ( Transformation). This is a WATERMARKED aggregation. # writing the aggregation to Console/Delta table in APPEND mode with a Trigger . However, the most recently published message to event hub is not writing to console/delta even after falling out of the watermark time. My understanding is the event should be inserted to the Delta table after Eventtime+Watermark. Moreover, all the events in the memory stored must be flushed out to the sink irrespective of the watermark before stopping to mark a graceful shutdown . Please advise. was: I am struggling with a unique issue. I am not sure if my understanding is wrong or this is a bug with spark. # I am reading a stream from events hub/kafka ( Extract) # Pivoting and Aggregating the above dataframe ( Transformation). This is a WATERMARKED aggregation. # writing the aggregation to Console/Delta table in APPEND mode with a Trigger . However, the most recently published message to event hub is not writing to console/delta even after falling out of the watermark time. My understanding is the event should be inserted to the Delta table after Eventtime+Watermark. Please advise. > Aggregation with Watermark in AppendMode is holding data beyong water mark > boundary. > > > Key: SPARK-38078 > URL: https://issues.apache.org/jira/browse/SPARK-38078 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: krishna >Priority: Major > > I am struggling with a unique issue. I am not sure if my understanding is > wrong or this is a bug with spark. > > # I am reading a stream from events hub/kafka ( Extract) > # Pivoting and Aggregating the above dataframe ( Transformation). This is a > WATERMARKED aggregation. > # writing the aggregation to Console/Delta table in APPEND mode with a > Trigger . > However, the most recently published message to event hub is not writing to > console/delta even after falling out of the watermark time. > > My understanding is the event should be inserted to the Delta table after > Eventtime+Watermark. > > Moreover, all the events in the memory stored must be flushed out to the sink > irrespective of the watermark before stopping to mark a graceful shutdown . > > Please advise. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38059) Incorrect query ordering with flatMap() and distinct()
[ https://issues.apache.org/jira/browse/SPARK-38059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485550#comment-17485550 ] Vu Tan commented on SPARK-38059: I ran your above Java app and got the below result (on spark 3.2.0) {code:java} 0,6 0,7 1,6 1,7 {code} So I think it is working as expected. > Incorrect query ordering with flatMap() and distinct() > -- > > Key: SPARK-38059 > URL: https://issues.apache.org/jira/browse/SPARK-38059 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.0.2, 3.2.0 >Reporter: AJ Bousquet >Priority: Major > > I have a Dataset of non-unique identifiers that I can use with > {{Dataset::flatMap()}} to create multiple rows with sub-identifiers for each > id. When I run the code below, the {{limit(2)}} call is placed _after_ the > call to {{flatMap()}} in the optimized logical plan. This unexpectedly yields > only 2 rows, when I would expect it to yield 6. > {code:java} > StructType idSchema = > DataTypes.createStructType(List.of(DataTypes.createStructField("id", > DataTypes.LongType, false))); > StructType flatMapSchema = DataTypes.createStructType(List.of( > DataTypes.createStructField("id", DataTypes.LongType, false), > DataTypes.createStructField("subId", DataTypes.LongType, false) > ));Dataset inputDataset = context.sparkSession().createDataset( > LongStream.range(0,5).mapToObj((id) -> > RowFactory.create(id)).collect(Collectors.toList()), > RowEncoder.apply(idSchema) > ); > return inputDataset > .distinct() > .limit(2) > .flatMap((Row row) -> { > Long id = row.getLong(row.fieldIndex("id")); return > LongStream.range(6,8).mapToObj((subid) -> RowFactory.create(id, > subid)).iterator(); > }, RowEncoder.apply(flatMapSchema)); {code} > When run, the above code produces something like: > ||id||subID|| > |0|6| > |0|7| > But I would expect something like: > ||id||subID|| > |1|6| > |1|7| > |1|8| > |0|6| > |0|7| > |0|8| -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38086) Make ArrowColumnVector Extendable
[ https://issues.apache.org/jira/browse/SPARK-38086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485530#comment-17485530 ] Kazuyuki Tanimura commented on SPARK-38086: --- I am working on this > Make ArrowColumnVector Extendable > - > > Key: SPARK-38086 > URL: https://issues.apache.org/jira/browse/SPARK-38086 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > Some Spark extension libraries need to extend ArrowColumnVector.java. For > now, it is impossible as ArrowColumnVector class is final and the accessors > are all private. > For example, Rapids copies the entire ArrowColumnVector class in order to > work around the issue > [https://github.com/NVIDIA/spark-rapids/blob/main/sql-plugin/src/main/java/org/apache/spark/sql/vectorized/rapids/AccessibleArrowColumnVector.java] > Proposing to relax private/final restrictions to make ArrowColumnVector > extendable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38086) Make ArrowColumnVector Extendable
Kazuyuki Tanimura created SPARK-38086: - Summary: Make ArrowColumnVector Extendable Key: SPARK-38086 URL: https://issues.apache.org/jira/browse/SPARK-38086 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Kazuyuki Tanimura Some Spark extension libraries need to extend ArrowColumnVector.java. For now, it is impossible as ArrowColumnVector class is final and the accessors are all private. For example, Rapids copies the entire ArrowColumnVector class in order to work around the issue [https://github.com/NVIDIA/spark-rapids/blob/main/sql-plugin/src/main/java/org/apache/spark/sql/vectorized/rapids/AccessibleArrowColumnVector.java] Proposing to relax private/final restrictions to make ArrowColumnVector extendable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35801) SPIP: Row-level operations in Data Source V2
[ https://issues.apache.org/jira/browse/SPARK-35801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485508#comment-17485508 ] L. C. Hsieh commented on SPARK-35801: - I think we can leave this open and put sub-tasks under this, like https://issues.apache.org/jira/browse/SPARK-34849. > SPIP: Row-level operations in Data Source V2 > > > Key: SPARK-35801 > URL: https://issues.apache.org/jira/browse/SPARK-35801 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Anton Okolnychyi >Priority: Major > Labels: SPIP > > Row-level operations such as UPDATE, DELETE, MERGE are becoming more and more > important for modern Big Data workflows. Use cases include but are not > limited to deleting a set of records for regulatory compliance, updating a > set of records to fix an issue in the ingestion pipeline, applying changes in > a transaction log to a fact table. Row-level operations allow users to easily > express their use cases that would otherwise require much more SQL. Common > patterns for updating partitions are to read, union, and overwrite or read, > diff, and append. Using commands like MERGE, these operations are easier to > express and can be more efficient to run. > Hive supports [MERGE|https://blog.cloudera.com/update-hive-tables-easy-way/] > and Spark should implement similar support. > SPIP: > https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38047) Add OUTLIER_NO_FALLBACK executor roll policy
[ https://issues.apache.org/jira/browse/SPARK-38047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38047: -- Summary: Add OUTLIER_NO_FALLBACK executor roll policy (was: Provide an option to only roll executors if they are outliers) > Add OUTLIER_NO_FALLBACK executor roll policy > > > Key: SPARK-38047 > URL: https://issues.apache.org/jira/browse/SPARK-38047 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Alex Holmes >Assignee: Alex Holmes >Priority: Major > Fix For: 3.3.0 > > > Currently executor rolling will always kill one executor every > {{{}spark.kubernetes.executor.rollInterval{}}}. For some of the policies this > may not be optimal in cases where the executor metric isn't an outlier > compared to other executors. There is a cost associated with killing > executors (ramp-up time for new executors for example) which applications may > not want to incur for non-outlier executors. > > This ticket would add the ability to only kill executors if they are > outliners. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38085) DataSource V2: Handle DELETE commands for group-based sources
Anton Okolnychyi created SPARK-38085: Summary: DataSource V2: Handle DELETE commands for group-based sources Key: SPARK-38085 URL: https://issues.apache.org/jira/browse/SPARK-38085 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Anton Okolnychyi As per SPARK-35801, we should handle DELETE statements for sources that can replace groups of data (e.g. partitions, files). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35801) SPIP: Row-level operations in Data Source V2
[ https://issues.apache.org/jira/browse/SPARK-35801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-35801: - Affects Version/s: 3.3.0 (was: 3.2.0) > SPIP: Row-level operations in Data Source V2 > > > Key: SPARK-35801 > URL: https://issues.apache.org/jira/browse/SPARK-35801 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Anton Okolnychyi >Priority: Major > Labels: SPIP > > Row-level operations such as UPDATE, DELETE, MERGE are becoming more and more > important for modern Big Data workflows. Use cases include but are not > limited to deleting a set of records for regulatory compliance, updating a > set of records to fix an issue in the ingestion pipeline, applying changes in > a transaction log to a fact table. Row-level operations allow users to easily > express their use cases that would otherwise require much more SQL. Common > patterns for updating partitions are to read, union, and overwrite or read, > diff, and append. Using commands like MERGE, these operations are easier to > express and can be more efficient to run. > Hive supports [MERGE|https://blog.cloudera.com/update-hive-tables-easy-way/] > and Spark should implement similar support. > SPIP: > https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35801) SPIP: Row-level operations in Data Source V2
[ https://issues.apache.org/jira/browse/SPARK-35801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485502#comment-17485502 ] Anton Okolnychyi commented on SPARK-35801: -- [~viirya], shall we keep this one open until the implementation is done or can we close it now? The community has already voted on this SPIP. > SPIP: Row-level operations in Data Source V2 > > > Key: SPARK-35801 > URL: https://issues.apache.org/jira/browse/SPARK-35801 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Anton Okolnychyi >Priority: Major > Labels: SPIP > > Row-level operations such as UPDATE, DELETE, MERGE are becoming more and more > important for modern Big Data workflows. Use cases include but are not > limited to deleting a set of records for regulatory compliance, updating a > set of records to fix an issue in the ingestion pipeline, applying changes in > a transaction log to a fact table. Row-level operations allow users to easily > express their use cases that would otherwise require much more SQL. Common > patterns for updating partitions are to read, union, and overwrite or read, > diff, and append. Using commands like MERGE, these operations are easier to > express and can be more efficient to run. > Hive supports [MERGE|https://blog.cloudera.com/update-hive-tables-easy-way/] > and Spark should implement similar support. > SPIP: > https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35801) SPIP: Row-level operations in Data Source V2
[ https://issues.apache.org/jira/browse/SPARK-35801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-35801: - Labels: SPIP (was: ) > SPIP: Row-level operations in Data Source V2 > > > Key: SPARK-35801 > URL: https://issues.apache.org/jira/browse/SPARK-35801 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Anton Okolnychyi >Priority: Major > Labels: SPIP > > Row-level operations such as UPDATE, DELETE, MERGE are becoming more and more > important for modern Big Data workflows. Use cases include but are not > limited to deleting a set of records for regulatory compliance, updating a > set of records to fix an issue in the ingestion pipeline, applying changes in > a transaction log to a fact table. Row-level operations allow users to easily > express their use cases that would otherwise require much more SQL. Common > patterns for updating partitions are to read, union, and overwrite or read, > diff, and append. Using commands like MERGE, these operations are easier to > express and can be more efficient to run. > Hive supports [MERGE|https://blog.cloudera.com/update-hive-tables-easy-way/] > and Spark should implement similar support. > SPIP: > https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38084) Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`
[ https://issues.apache.org/jira/browse/SPARK-38084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38084. --- Fix Version/s: 3.3.0 3.2.2 Resolution: Fixed Issue resolved by pull request 35381 [https://github.com/apache/spark/pull/35381] > Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py` > > > Key: SPARK-38084 > URL: https://issues.apache.org/jira/browse/SPARK-38084 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.3.0, 3.2.2 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0, 3.2.2 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38084) Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`
[ https://issues.apache.org/jira/browse/SPARK-38084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38084: - Assignee: Dongjoon Hyun > Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py` > > > Key: SPARK-38084 > URL: https://issues.apache.org/jira/browse/SPARK-38084 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.3.0, 3.2.2 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38075) Hive script transform with order by and limit will return fake rows
[ https://issues.apache.org/jira/browse/SPARK-38075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-38075: - Fix Version/s: 3.1.3 (was: 3.1.4) > Hive script transform with order by and limit will return fake rows > --- > > Key: SPARK-38075 > URL: https://issues.apache.org/jira/browse/SPARK-38075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Labels: correctness > Fix For: 3.1.3, 3.3.0, 3.2.2 > > > For example: > {noformat} > create or replace temp view t as > select * from values > (1), > (2), > (3) > as t(a); > select transform(a) > USING 'cat' AS (a int) > FROM t order by a limit 10; > {noformat} > This returns: > {noformat} > NULL > NULL > NULL > 1 > 2 > 3 > {noformat} > Without {{order by}} and {{limit}}, the query returns: > {noformat} > 1 > 2 > 3 > {noformat} > Spark script transform does not have this issue. That is, if > {{spark.sql.catalogImplementation=in-memory}}, Spark does not return fake > rows. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24156) Enable no-data micro batches for more eager streaming state clean up
[ https://issues.apache.org/jira/browse/SPARK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483838#comment-17483838 ] krishna edited comment on SPARK-24156 at 2/1/22, 9:04 PM: -- Hi [~kcsrms] [~tdas] , I am having the same issue. Is this issue resovled? is there a specific version I need to choose? I am struggling with a unique issue. I am not sure if my understanding is wrong or this is a bug with spark. # I am reading a stream from events hub ( Extract) # Pivoting and Aggregating the above dataframe ( Transformation). This is a WATERMARKED aggregation. # writing the aggregation to Delta table in APPEND mode with a Trigger . However, the most recently published message to event hub is not writing to delta even after falling out of the watermark time. My understanding is the data should be inserted to the Delta table after Eventtime+Watermark. Moreover, all the events in the memory stored must be flushed out to the sink before stopping to mark a graceful shutdown. was (Author: JIRAUSER284389): Hi [~kcsrms] [~tdas] , I am having the same issue. Is this issue resovled? is there a specific version I need to choose? I am struggling with a unique issue. I am not sure if my understanding is wrong or this is a bug with spark. # I am reading a stream from events hub ( Extract) # Pivoting and Aggregating the above dataframe ( Transformation). This is a WATERMARKED aggregation. # writing the aggregation to Delta table in APPEND mode with a Trigger . However, the most recently published message to event hub is not writing to delta even after falling out of the watermark time. My understanding is the data should be inserted to the Delta table after Eventtime+Watermark. > Enable no-data micro batches for more eager streaming state clean up > - > > Key: SPARK-24156 > URL: https://issues.apache.org/jira/browse/SPARK-24156 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 2.4.0 > > > Currently, MicroBatchExecution in Structured Streaming runs batches only when > there is new data to process. This is sensible in most cases as we dont want > to unnecessarily use resources when there is nothing new to process. However, > in some cases of stateful streaming queries, this delays state clean up as > well as clean-up based output. For example, consider a streaming aggregation > query with watermark-based state cleanup. The watermark is updated after > every batch with new data completes. The updated value is used in the next > batch to clean up state, and output finalized aggregates in append mode. > However, if there is no data, then the next batch does not occur, and > cleanup/output gets delayed unnecessarily. This is true for all stateful > streaming operators - aggregation, deduplication, joins, mapGroupsWithState > This issue tracks the work to enable no-data batches in MicroBatchExecution. > The major challenge is that all the tests of relevant stateful operations add > dummy data to force another batch for testing the state cleanup. So a lot of > the tests are going to be changed. So my plan is to enable no-data batches > for different stateful operators one at a time. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38075) Hive script transform with order by and limit will return fake rows
[ https://issues.apache.org/jira/browse/SPARK-38075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-38075: - Fix Version/s: 3.1.4 (was: 3.1.3) Affects Version/s: 3.1.3 > Hive script transform with order by and limit will return fake rows > --- > > Key: SPARK-38075 > URL: https://issues.apache.org/jira/browse/SPARK-38075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Labels: correctness > Fix For: 3.3.0, 3.1.4, 3.2.2 > > > For example: > {noformat} > create or replace temp view t as > select * from values > (1), > (2), > (3) > as t(a); > select transform(a) > USING 'cat' AS (a int) > FROM t order by a limit 10; > {noformat} > This returns: > {noformat} > NULL > NULL > NULL > 1 > 2 > 3 > {noformat} > Without {{order by}} and {{limit}}, the query returns: > {noformat} > 1 > 2 > 3 > {noformat} > Spark script transform does not have this issue. That is, if > {{spark.sql.catalogImplementation=in-memory}}, Spark does not return fake > rows. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38084) Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`
[ https://issues.apache.org/jira/browse/SPARK-38084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485430#comment-17485430 ] Apache Spark commented on SPARK-38084: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/35381 > Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py` > > > Key: SPARK-38084 > URL: https://issues.apache.org/jira/browse/SPARK-38084 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.3.0, 3.2.2 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38084) Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`
[ https://issues.apache.org/jira/browse/SPARK-38084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38084: Assignee: Apache Spark > Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py` > > > Key: SPARK-38084 > URL: https://issues.apache.org/jira/browse/SPARK-38084 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.3.0, 3.2.2 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38084) Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`
[ https://issues.apache.org/jira/browse/SPARK-38084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38084: Assignee: (was: Apache Spark) > Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py` > > > Key: SPARK-38084 > URL: https://issues.apache.org/jira/browse/SPARK-38084 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.3.0, 3.2.2 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38084) Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`
Dongjoon Hyun created SPARK-38084: - Summary: Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py` Key: SPARK-38084 URL: https://issues.apache.org/jira/browse/SPARK-38084 Project: Spark Issue Type: Test Components: Tests Affects Versions: 3.3.0, 3.2.2 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38047) Provide an option to only roll executors if they are outliers
[ https://issues.apache.org/jira/browse/SPARK-38047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38047. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35373 [https://github.com/apache/spark/pull/35373] > Provide an option to only roll executors if they are outliers > - > > Key: SPARK-38047 > URL: https://issues.apache.org/jira/browse/SPARK-38047 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Alex Holmes >Assignee: Alex Holmes >Priority: Major > Fix For: 3.3.0 > > > Currently executor rolling will always kill one executor every > {{{}spark.kubernetes.executor.rollInterval{}}}. For some of the policies this > may not be optimal in cases where the executor metric isn't an outlier > compared to other executors. There is a cost associated with killing > executors (ramp-up time for new executors for example) which applications may > not want to incur for non-outlier executors. > > This ticket would add the ability to only kill executors if they are > outliners. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27826) saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" format issue
[ https://issues.apache.org/jira/browse/SPARK-27826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485393#comment-17485393 ] Bandhu Gupta commented on SPARK-27826: -- Hi Fengtlyer , We are facing exactly the same issue which you reported here. I wanted to understand as what are we missing here to resolve this issue. Please let me know what you understand from Hyukjin's comment on saveAsTable comment. > saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" > format issue > > > Key: SPARK-27826 > URL: https://issues.apache.org/jira/browse/SPARK-27826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.4.0 > Environment: CDH 5.13.1 - Spark version 2.2.0.cloudera2 > CDH 6.1.1 - Spark version 2.4.0-cdh6.1.1 >Reporter: fengtlyer >Priority: Minor > > Hi Spark Dev Team, > We tested a few times and found this bug can reappearance in multi Spark > version > We tested in CDH 5.13.1 - Spark version 2.2.0.cloudera2 and CDH 6.1.1 - Spark > version 2.4.0-cdh6.1.1 > Both of them have this bug: > 1. If one table created by Impala or Hive in the HUE, then in Spark code, > "write.format("parquet").mode("append").saveAsTable()" will case the format > issue (see the below error log) > 2. Hive/Impala in the HUE created table, then > "write.format("parquet").mode("overwrite").saveAsTable()", this code still > does not work. > 2.1 Hive/Impala in the HUE created table, and > "write.format("parquet").mode("overwrite").saveAsTable()", then > "write.format("parquet").mode("append").saveAsTable()" can work. > 3. Hive/Impala in the HUE created table, then "insertInto()" still will work. > 3.1 Hive/Impala in the HUE created a table, and used "insertInto()" insert > some new record, then try to use > "write.format("parquet").mode("append").saveAsTable()", it will get the same > format error log > 4. Created parquet table and insert some data by Hive shell, then > "write.format("parquet").mode("append").saveAsTable()" can insert data, but > spark only shows data which insert by spark, and Hive only show data which > hive insert. > === > Error Log > === > {code} > spark.read.format("csv").option("sep",",").option("header","true").load("hdfs:///temp1/test_paquettest.csv").write.format("parquet").mode("append").saveAsTable("parquet_test_table") > {code} > {code} > org.apache.spark.sql.AnalysisException: The format of the existing table > default.parquet_test_table is `HiveFileFormat`. It doesn't match the > specified format `ParquetFileFormat`.; > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:115) > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75) > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50) > at >
[jira] [Updated] (SPARK-38083) set the amount of explained variance as parameter of pyspark.ml.feature.PCA
[ https://issues.apache.org/jira/browse/SPARK-38083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola updated SPARK-38083: --- Summary: set the amount of explained variance as parameter of pyspark.ml.feature.PCA (was: set the amout of explained variance as parameter of pyspark.ml.feature.PCA) > set the amount of explained variance as parameter of pyspark.ml.feature.PCA > --- > > Key: SPARK-38083 > URL: https://issues.apache.org/jira/browse/SPARK-38083 > Project: Spark > Issue Type: Wish > Components: ML, MLlib >Affects Versions: 3.2.2 >Reporter: Nicola >Priority: Major > > As in > [sklearn.decomposition.PCA|https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html], > where: > if {{0 < n_components < 1}} select the number of components such that the > amount of variance that needs to be explained is greater than the percentage > specified by n_components > it would be useful to have a similar behavior with the k parameter in > pyspark.ml.feature.PCA. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38083) set the amout of explained variance as parameter of pyspark.ml.feature.PCA
[ https://issues.apache.org/jira/browse/SPARK-38083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola updated SPARK-38083: --- Affects Version/s: 3.2.2 (was: 3.2.1) > set the amout of explained variance as parameter of pyspark.ml.feature.PCA > -- > > Key: SPARK-38083 > URL: https://issues.apache.org/jira/browse/SPARK-38083 > Project: Spark > Issue Type: Wish > Components: ML, MLlib >Affects Versions: 3.2.2 >Reporter: Nicola >Priority: Major > > As in > [sklearn.decomposition.PCA|https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html], > where: > if {{0 < n_components < 1}} select the number of components such that the > amount of variance that needs to be explained is greater than the percentage > specified by n_components > it would be useful to have a similar behavior with the k parameter in > pyspark.ml.feature.PCA. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38083) set the amout of explained variance as parameter of pyspark.ml.feature.PCA
[ https://issues.apache.org/jira/browse/SPARK-38083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola updated SPARK-38083: --- Component/s: ML > set the amout of explained variance as parameter of pyspark.ml.feature.PCA > -- > > Key: SPARK-38083 > URL: https://issues.apache.org/jira/browse/SPARK-38083 > Project: Spark > Issue Type: Wish > Components: ML, MLlib >Affects Versions: 3.2.1 >Reporter: Nicola >Priority: Major > > As in > [sklearn.decomposition.PCA|https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html], > where: > if {{0 < n_components < 1}} select the number of components such that the > amount of variance that needs to be explained is greater than the percentage > specified by n_components > it would be useful to have a similar behavior with the k parameter in > pyspark.ml.feature.PCA. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38083) set the amout of explained variance as parameter of pyspark.ml.feature.PCA
Nicola created SPARK-38083: -- Summary: set the amout of explained variance as parameter of pyspark.ml.feature.PCA Key: SPARK-38083 URL: https://issues.apache.org/jira/browse/SPARK-38083 Project: Spark Issue Type: Wish Components: MLlib Affects Versions: 3.2.1 Reporter: Nicola As in [sklearn.decomposition.PCA|https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html], where: if {{0 < n_components < 1}} select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components it would be useful to have a similar behavior with the k parameter in pyspark.ml.feature.PCA. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38081) Support cloud-backend in K8s IT with SBT
[ https://issues.apache.org/jira/browse/SPARK-38081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38081. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35376 [https://github.com/apache/spark/pull/35376] > Support cloud-backend in K8s IT with SBT > > > Key: SPARK-38081 > URL: https://issues.apache.org/jira/browse/SPARK-38081 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38081) Support cloud-backend in K8s IT with SBT
[ https://issues.apache.org/jira/browse/SPARK-38081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38081: - Assignee: Dongjoon Hyun > Support cloud-backend in K8s IT with SBT > > > Key: SPARK-38081 > URL: https://issues.apache.org/jira/browse/SPARK-38081 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38082) Update minimum numpy version
[ https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-38082: --- Description: Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. However, 1.7 has been released almost 9 years ago and since then some methods that we use have been deprecated in favor of new additions and anew API ({{numpy.typing}}, that is of some interest to us, has been added. We should update minimum version requirement to one of the following - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace deprecated {{tostring}} calls with {{tobytes}}. - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our minimum supported pandas version. - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. The last one might be somewhat controversial, but 1.15 shouldn't require much discussion. was: Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. However, 1.7 has been released almost 9 years ago and since then some methods that we use have been deprecated in favor of new additions and anew API ({{numpy.typing}}, that is of some interest to us, has been added. We should update minimum version requirement to: - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace deprecated {{tostring}} calls with {{tobytes}}. - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our minimum supported pandas version. - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. The last one might be somewhat controversial, but 1.15 shouldn't require much discussion. > Update minimum numpy version > > > Key: SPARK-38082 > URL: https://issues.apache.org/jira/browse/SPARK-38082 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. > However, 1.7 has been released almost 9 years ago and since then some methods > that we use have been deprecated in favor of new additions and anew API > ({{numpy.typing}}, that is of some interest to us, has been added. > We should update minimum version requirement to one of the following > - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to > replace deprecated {{tostring}} calls with {{tobytes}}. > - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our > minimum supported pandas version. > - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. > The last one might be somewhat controversial, but 1.15 shouldn't require much > discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38082) Update minimum numpy version
[ https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-38082: --- Description: Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. However, 1.7 has been released almost 9 years old and since then, some methods that we use have been deprecated in favor of new additions, and new API ({{numpy.typing}}, that is of some interest to us, has been added. We should update minimum version requirement to: - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace deprecated {{tostring}} calls with {{tobytes}}. - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our minimum supported pandas version. - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. The last one might be somewhat controversial, but 1.15 shouldn't require much discussion. was: Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. However, 1.7 has been released over almost 9 years old and since then, some methods that we use have been deprecated in favor of new additions, and new API ({{numpy.typing}}, that is of some interest to us, has been added. We should update minimum version requirement to: - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace deprecated {{tostring}} calls with {{tobytes}}. - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our minimum supported pandas version. - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. The last one might be somewhat controversial, but 1.15 shouldn't require much discussion. > Update minimum numpy version > > > Key: SPARK-38082 > URL: https://issues.apache.org/jira/browse/SPARK-38082 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. > However, 1.7 has been released almost 9 years old and since then, some > methods that we use have been deprecated in favor of new additions, and new > API ({{numpy.typing}}, that is of some interest to us, has been added. > We should update minimum version requirement to: > - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to > replace deprecated {{tostring}} calls with {{tobytes}}. > - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our > minimum supported pandas version. > - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. > The last one might be somewhat controversial, but 1.15 shouldn't require much > discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38082) Update minimum numpy version
[ https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-38082: --- Description: Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. However, 1.7 has been released almost 9 years ago and since then some methods that we use have been deprecated in favor of new additions and anew API ({{numpy.typing}}, that is of some interest to us, has been added. We should update minimum version requirement to: - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace deprecated {{tostring}} calls with {{tobytes}}. - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our minimum supported pandas version. - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. The last one might be somewhat controversial, but 1.15 shouldn't require much discussion. was: Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. However, 1.7 has been released almost 9 years ago and since then some methods that we use have been deprecated in favor of new additions, and new API ({{numpy.typing}}, that is of some interest to us, has been added. We should update minimum version requirement to: - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace deprecated {{tostring}} calls with {{tobytes}}. - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our minimum supported pandas version. - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. The last one might be somewhat controversial, but 1.15 shouldn't require much discussion. > Update minimum numpy version > > > Key: SPARK-38082 > URL: https://issues.apache.org/jira/browse/SPARK-38082 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. > However, 1.7 has been released almost 9 years ago and since then some methods > that we use have been deprecated in favor of new additions and anew API > ({{numpy.typing}}, that is of some interest to us, has been added. > We should update minimum version requirement to: > - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to > replace deprecated {{tostring}} calls with {{tobytes}}. > - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our > minimum supported pandas version. > - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. > The last one might be somewhat controversial, but 1.15 shouldn't require much > discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38082) Update minimum numpy version
[ https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-38082: --- Description: Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. However, 1.7 has been released almost 9 years ago and since then some methods that we use have been deprecated in favor of new additions, and new API ({{numpy.typing}}, that is of some interest to us, has been added. We should update minimum version requirement to: - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace deprecated {{tostring}} calls with {{tobytes}}. - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our minimum supported pandas version. - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. The last one might be somewhat controversial, but 1.15 shouldn't require much discussion. was: Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. However, 1.7 has been released almost 9 years old and since then, some methods that we use have been deprecated in favor of new additions, and new API ({{numpy.typing}}, that is of some interest to us, has been added. We should update minimum version requirement to: - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace deprecated {{tostring}} calls with {{tobytes}}. - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our minimum supported pandas version. - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. The last one might be somewhat controversial, but 1.15 shouldn't require much discussion. > Update minimum numpy version > > > Key: SPARK-38082 > URL: https://issues.apache.org/jira/browse/SPARK-38082 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. > However, 1.7 has been released almost 9 years ago and since then some methods > that we use have been deprecated in favor of new additions, and new API > ({{numpy.typing}}, that is of some interest to us, has been added. > We should update minimum version requirement to: > - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to > replace deprecated {{tostring}} calls with {{tobytes}}. > - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our > minimum supported pandas version. > - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. > The last one might be somewhat controversial, but 1.15 shouldn't require much > discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38082) Update minimum numpy version
[ https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485327#comment-17485327 ] Maciej Szymkiewicz commented on SPARK-38082: cc [~hyukjin.kwon] [~WeichenXu123] [~huaxingao] > Update minimum numpy version > > > Key: SPARK-38082 > URL: https://issues.apache.org/jira/browse/SPARK-38082 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. > However, 1.7 has been released over almost 9 years old and since then, some > methods that we use have been deprecated in favor of new additions, and new > API ({{numpy.typing}}, that is of some interest to us, has been added. > We should update minimum version requirement to: > - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to > replace deprecated {{tostring}} calls with {{tobytes}}. > - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our > minimum supported pandas version. > - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. > The last one might be somewhat controversial, but 1.15 shouldn't require much > discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38082) Update minimum numpy version
[ https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-38082: --- Description: Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. However, 1.7 has been released over almost 9 years old and since then, some methods that we use have been deprecated in favor of new additions, and new API ({{numpy.typing}}, that is of some interest to us, has been added. We should update minimum version requirement to: - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace deprecated {{tostring}} calls with {{tobytes}}. - {{>=1.15}} (released 2018-07-23) ‒ this is reasonable bound to match our minimum supported pandas version. - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. The last one might be somewhat controversial, but 1.15 shouldn't require much discussion. was: Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. However, 1.7 has been released over almost 9 years old and since then, some methods that we use have been deprecated in favor of new additions, and new API ({{numpy.typing}}, that is of some interest to us, has been added. We should update minimum version requirement to: - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace deprecated {{tostring}} calls with {{tobytes}}. - {{>=1.15}} (released 2018-07-23) ‒ this is reasonable bound to match our minimum supported pandas version. - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. The last one might be somewhat controversial, but 1.15 should require much discussion. > Update minimum numpy version > > > Key: SPARK-38082 > URL: https://issues.apache.org/jira/browse/SPARK-38082 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. > However, 1.7 has been released over almost 9 years old and since then, some > methods that we use have been deprecated in favor of new additions, and new > API ({{numpy.typing}}, that is of some interest to us, has been added. > We should update minimum version requirement to: > - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to > replace deprecated {{tostring}} calls with {{tobytes}}. > - {{>=1.15}} (released 2018-07-23) ‒ this is reasonable bound to match our > minimum supported pandas version. > - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. > The last one might be somewhat controversial, but 1.15 shouldn't require much > discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38082) Update minimum numpy version
[ https://issues.apache.org/jira/browse/SPARK-38082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-38082: --- Description: Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. However, 1.7 has been released over almost 9 years old and since then, some methods that we use have been deprecated in favor of new additions, and new API ({{numpy.typing}}, that is of some interest to us, has been added. We should update minimum version requirement to: - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace deprecated {{tostring}} calls with {{tobytes}}. - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our minimum supported pandas version. - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. The last one might be somewhat controversial, but 1.15 shouldn't require much discussion. was: Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. However, 1.7 has been released over almost 9 years old and since then, some methods that we use have been deprecated in favor of new additions, and new API ({{numpy.typing}}, that is of some interest to us, has been added. We should update minimum version requirement to: - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace deprecated {{tostring}} calls with {{tobytes}}. - {{>=1.15}} (released 2018-07-23) ‒ this is reasonable bound to match our minimum supported pandas version. - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. The last one might be somewhat controversial, but 1.15 shouldn't require much discussion. > Update minimum numpy version > > > Key: SPARK-38082 > URL: https://issues.apache.org/jira/browse/SPARK-38082 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. > However, 1.7 has been released over almost 9 years old and since then, some > methods that we use have been deprecated in favor of new additions, and new > API ({{numpy.typing}}, that is of some interest to us, has been added. > We should update minimum version requirement to: > - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to > replace deprecated {{tostring}} calls with {{tobytes}}. > - {{>=1.15.0}} (released 2018-07-23) ‒ this is reasonable bound to match our > minimum supported pandas version. > - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. > The last one might be somewhat controversial, but 1.15 shouldn't require much > discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38082) Update minimum numpy version
Maciej Szymkiewicz created SPARK-38082: -- Summary: Update minimum numpy version Key: SPARK-38082 URL: https://issues.apache.org/jira/browse/SPARK-38082 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark Affects Versions: 3.3.0 Reporter: Maciej Szymkiewicz Currently, we use set numpy version in {{extras_require}} to be {{>=1.7}}. However, 1.7 has been released over almost 9 years old and since then, some methods that we use have been deprecated in favor of new additions, and new API ({{numpy.typing}}, that is of some interest to us, has been added. We should update minimum version requirement to: - {{>=1.9.0}} ‒ this is minimum reasonable bound, that will allow us to replace deprecated {{tostring}} calls with {{tobytes}}. - {{>=1.15}} (released 2018-07-23) ‒ this is reasonable bound to match our minimum supported pandas version. - {{>=1.20.0}} (released 2021-01-30) ‒ to fully utilize numpy typing. The last one might be somewhat controversial, but 1.15 should require much discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37417) Inline type hints for python/pyspark/ml/linalg/__init__.py
[ https://issues.apache.org/jira/browse/SPARK-37417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485313#comment-17485313 ] Apache Spark commented on SPARK-37417: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/35380 > Inline type hints for python/pyspark/ml/linalg/__init__.py > -- > > Key: SPARK-37417 > URL: https://issues.apache.org/jira/browse/SPARK-37417 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/ml/linalg/__init__.pyi to > python/pyspark/ml/linalg/__init__.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37417) Inline type hints for python/pyspark/ml/linalg/__init__.py
[ https://issues.apache.org/jira/browse/SPARK-37417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37417: Assignee: Apache Spark > Inline type hints for python/pyspark/ml/linalg/__init__.py > -- > > Key: SPARK-37417 > URL: https://issues.apache.org/jira/browse/SPARK-37417 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Major > > Inline type hints from python/pyspark/ml/linalg/__init__.pyi to > python/pyspark/ml/linalg/__init__.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37417) Inline type hints for python/pyspark/ml/linalg/__init__.py
[ https://issues.apache.org/jira/browse/SPARK-37417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37417: Assignee: (was: Apache Spark) > Inline type hints for python/pyspark/ml/linalg/__init__.py > -- > > Key: SPARK-37417 > URL: https://issues.apache.org/jira/browse/SPARK-37417 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/ml/linalg/__init__.pyi to > python/pyspark/ml/linalg/__init__.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37417) Inline type hints for python/pyspark/ml/linalg/__init__.py
[ https://issues.apache.org/jira/browse/SPARK-37417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485312#comment-17485312 ] Apache Spark commented on SPARK-37417: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/35380 > Inline type hints for python/pyspark/ml/linalg/__init__.py > -- > > Key: SPARK-37417 > URL: https://issues.apache.org/jira/browse/SPARK-37417 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/ml/linalg/__init__.pyi to > python/pyspark/ml/linalg/__init__.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37980) Extend METADATA column to support row indices for file based data sources
[ https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485215#comment-17485215 ] Cheng Lian commented on SPARK-37980: [~prakharjain09], as you've mentioned, it's not super straightforward to customize the Parquet code paths in Spark to achieve the goal. In the meanwhile, this functionality is in general quite useful. I can imagine it enabling other systems in the Parquet ecosystem to build more sophisticated indexing solutions. Instead of doing heavy customizations in Spark, would it be better if we can make the changes happen in upstream {{parquet-mr}} so that other systems can benefit from it more easily? > Extend METADATA column to support row indices for file based data sources > - > > Key: SPARK-37980 > URL: https://issues.apache.org/jira/browse/SPARK-37980 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3 >Reporter: Prakhar Jain >Priority: Major > > Spark recently added hidden metadata column support for File based > datasources as part of SPARK-37273. > We should extend it to support ROW_INDEX/ROW_POSITION also. > > Meaning of ROW_POSITION: > ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th > row in the file will have ROW_INDEX 5. > > Use cases: > Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple > uniquely identifies row in a table. This information can be used to mark rows > e.g. this can be used by indexer etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38067) Inconsistent missing values handling in Pandas on Spark to_json
[ https://issues.apache.org/jira/browse/SPARK-38067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz reassigned SPARK-38067: -- Assignee: Bjørn Jørgensen > Inconsistent missing values handling in Pandas on Spark to_json > --- > > Key: SPARK-38067 > URL: https://issues.apache.org/jira/browse/SPARK-38067 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > > If {{ps.DataFrame.to_json}} is called without {{path}} argument, missing > values are written explicitly > {code:python} > import pandas as pd > import pyspark.pandas as ps > pdf = pd.DataFrame({"id": [1, 2, 3], "value": [None, 3, None]}) > psf = ps.from_pandas(pdf) > psf.to_json() > ## '[{"id":1,"value":null},{"id":2,"value":3.0},{"id":3,"value":null}]' > {code:python} > This behavior is consistent with Pandas: > {code:python} > pdf.to_json() > ## '{"id":{"0":1,"1":2,"2":3},"value":{"0":null,"1":3.0,"2":null}}' > {code} > However, if {{path}} is provided, missing values are omitted by default: > {code:python} > import tempfile > path = tempfile.mktemp() > psf.to_json(path) > spark.read.text(path).show() > ## ++ > ## | value| > ## ++ > ## |{"id":2,"value":3.0}| > ## |{"id":3}| > ## |{"id":1}| > ## ++ > {code} > We should set {{ignoreNullFields}} for Pandas API, to be `False` by default, > so both cases handle missing values in the same way. > {code:python} > psf.to_json(path, ignoreNullFields=False) > spark.read.text(path).show(truncate=False) > ## +-+ > ## |value| > ## +-+ > ## |{"id":3,"value":null}| > ## |{"id":1,"value":null}| > ## |{"id":2,"value":3.0} | > ## +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38067) Inconsistent missing values handling in Pandas on Spark to_json
[ https://issues.apache.org/jira/browse/SPARK-38067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz resolved SPARK-38067. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35296 [https://github.com/apache/spark/pull/35296] > Inconsistent missing values handling in Pandas on Spark to_json > --- > > Key: SPARK-38067 > URL: https://issues.apache.org/jira/browse/SPARK-38067 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.3.0 > > > If {{ps.DataFrame.to_json}} is called without {{path}} argument, missing > values are written explicitly > {code:python} > import pandas as pd > import pyspark.pandas as ps > pdf = pd.DataFrame({"id": [1, 2, 3], "value": [None, 3, None]}) > psf = ps.from_pandas(pdf) > psf.to_json() > ## '[{"id":1,"value":null},{"id":2,"value":3.0},{"id":3,"value":null}]' > {code:python} > This behavior is consistent with Pandas: > {code:python} > pdf.to_json() > ## '{"id":{"0":1,"1":2,"2":3},"value":{"0":null,"1":3.0,"2":null}}' > {code} > However, if {{path}} is provided, missing values are omitted by default: > {code:python} > import tempfile > path = tempfile.mktemp() > psf.to_json(path) > spark.read.text(path).show() > ## ++ > ## | value| > ## ++ > ## |{"id":2,"value":3.0}| > ## |{"id":3}| > ## |{"id":1}| > ## ++ > {code} > We should set {{ignoreNullFields}} for Pandas API, to be `False` by default, > so both cases handle missing values in the same way. > {code:python} > psf.to_json(path, ignoreNullFields=False) > spark.read.text(path).show(truncate=False) > ## +-+ > ## |value| > ## +-+ > ## |{"id":3,"value":null}| > ## |{"id":1,"value":null}| > ## |{"id":2,"value":3.0} | > ## +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38067) Inconsistent missing values handling in Pandas on Spark to_json
[ https://issues.apache.org/jira/browse/SPARK-38067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-38067: --- Description: If {{ps.DataFrame.to_json}} is called without {{path}} argument, missing values are written explicitly {code:python} import pandas as pd import pyspark.pandas as ps pdf = pd.DataFrame({"id": [1, 2, 3], "value": [None, 3, None]}) psf = ps.from_pandas(pdf) psf.to_json() ## '[{"id":1,"value":null},{"id":2,"value":3.0},{"id":3,"value":null}]' {code:python} This behavior is consistent with Pandas: {code:python} pdf.to_json() ## '{"id":{"0":1,"1":2,"2":3},"value":{"0":null,"1":3.0,"2":null}}' {code} However, if {{path}} is provided, missing values are omitted by default: {code:python} import tempfile path = tempfile.mktemp() psf.to_json(path) spark.read.text(path).show() ## ++ ## | value| ## ++ ## |{"id":2,"value":3.0}| ## |{"id":3}| ## |{"id":1}| ## ++ {code} We should set {{ignoreNullFields}} for Pandas API, to be `False` by default, so both cases handle missing values in the same way. {code:python} psf.to_json(path, ignoreNullFields=False) spark.read.text(path).show(truncate=False) ## +-+ ## |value| ## +-+ ## |{"id":3,"value":null}| ## |{"id":1,"value":null}| ## |{"id":2,"value":3.0} | ## +-+ {code} was: With pandas {code:java} data = {'col_1': [3, 2, 1, 0], 'col_2': [None, None, None, None]} test_pd = pd.DataFrame.from_dict(data) test_pd.shape {code} (4, 2) {code:java} test_pd.to_json("testpd.json") test_pd2 = pd.read_json("testpd.json") test_pd2.shape {code} (4, 2) Pandas on spark API does delete the column that has all values Null. {code:java} data = {'col_1': [3, 2, 1, 0], 'col_2': [None, None, None, None]} test_ps = ps.DataFrame.from_dict(data) test_ps.shape {code} (4, 2) {code:java} test_ps.to_json("testps.json") test_ps2 = ps.read_json("testps.json/*") test_ps2.shape {code} (4, 1) We need to change this to make pandas on spark API be more like pandas. I have opened a PR for this. > Inconsistent missing values handling in Pandas on Spark to_json > --- > > Key: SPARK-38067 > URL: https://issues.apache.org/jira/browse/SPARK-38067 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Bjørn Jørgensen >Priority: Major > > If {{ps.DataFrame.to_json}} is called without {{path}} argument, missing > values are written explicitly > {code:python} > import pandas as pd > import pyspark.pandas as ps > pdf = pd.DataFrame({"id": [1, 2, 3], "value": [None, 3, None]}) > psf = ps.from_pandas(pdf) > psf.to_json() > ## '[{"id":1,"value":null},{"id":2,"value":3.0},{"id":3,"value":null}]' > {code:python} > This behavior is consistent with Pandas: > {code:python} > pdf.to_json() > ## '{"id":{"0":1,"1":2,"2":3},"value":{"0":null,"1":3.0,"2":null}}' > {code} > However, if {{path}} is provided, missing values are omitted by default: > {code:python} > import tempfile > path = tempfile.mktemp() > psf.to_json(path) > spark.read.text(path).show() > ## ++ > ## | value| > ## ++ > ## |{"id":2,"value":3.0}| > ## |{"id":3}| > ## |{"id":1}| > ## ++ > {code} > We should set {{ignoreNullFields}} for Pandas API, to be `False` by default, > so both cases handle missing values in the same way. > {code:python} > psf.to_json(path, ignoreNullFields=False) > spark.read.text(path).show(truncate=False) > ## +-+ > ## |value| > ## +-+ > ## |{"id":3,"value":null}| > ## |{"id":1,"value":null}| > ## |{"id":2,"value":3.0} | > ## +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38067) Inconsistent missing values handling in Pandas on Spark to_json
[ https://issues.apache.org/jira/browse/SPARK-38067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-38067: --- Summary: Inconsistent missing values handling in Pandas on Spark to_json (was: Pandas on spark deletes columns with all None as default.) > Inconsistent missing values handling in Pandas on Spark to_json > --- > > Key: SPARK-38067 > URL: https://issues.apache.org/jira/browse/SPARK-38067 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Bjørn Jørgensen >Priority: Major > > With pandas > {code:java} > data = {'col_1': [3, 2, 1, 0], 'col_2': [None, None, None, None]} > test_pd = pd.DataFrame.from_dict(data) > test_pd.shape > {code} > (4, 2) > {code:java} > test_pd.to_json("testpd.json") > test_pd2 = pd.read_json("testpd.json") > test_pd2.shape > {code} > (4, 2) > Pandas on spark API does delete the column that has all values Null. > {code:java} > data = {'col_1': [3, 2, 1, 0], 'col_2': [None, None, None, None]} > test_ps = ps.DataFrame.from_dict(data) > test_ps.shape > {code} > (4, 2) > {code:java} > test_ps.to_json("testps.json") > test_ps2 = ps.read_json("testps.json/*") > test_ps2.shape > {code} > (4, 1) > We need to change this to make pandas on spark API be more like pandas. > I have opened a PR for this. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org