[jira] [Created] (SPARK-37023) Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry
Ye Zhou created SPARK-37023: --- Summary: Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry Key: SPARK-37023 URL: https://issues.apache.org/jira/browse/SPARK-37023 Project: Spark Issue Type: Sub-task Components: Shuffle Affects Versions: 3.2.0 Reporter: Ye Zhou The assertion below in MapOutoutputTracker.getMapSizesByExecutorId is not guaranteed {code:java} assert(mapSizesByExecutorId.enableBatchFetch == true){code} The reason is during some stage retry cases, the shuffleDependency.shuffleMergeEnabled is set to false, but there will be mergeStatus since the Driver has collected the merged status for its shuffle dependency. If this is the case, the current implementation would set the enableBatchFetch to false, since there are mergeStatus. Details can be found here: [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1492] We should improve the implementation here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37004) Job cancellation causes py4j errors on Jupyter due to pinned thread mode
[ https://issues.apache.org/jira/browse/SPARK-37004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429523#comment-17429523 ] Hyukjin Kwon commented on SPARK-37004: -- Just for other people who face this issue from Spark 3.2.0: workaround for this issue is to set {{PYSPARK_PIN_THREAD}} environment variable to {{false}}. > Job cancellation causes py4j errors on Jupyter due to pinned thread mode > > > Key: SPARK-37004 > URL: https://issues.apache.org/jira/browse/SPARK-37004 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: pinned.ipynb > > > Spark 3.2.0 turned on py4j pinned thread mode by default (SPARK-35303). > However, in a jupyter notebook, after I cancel (interrupt) a long-running > Spark job, the next Spark command will fail with some py4j errors. See > attached notebook for repro. > Cannot reproduce the issue after I turn off pinned thread mode . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36232) Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
[ https://issues.apache.org/jira/browse/SPARK-36232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429519#comment-17429519 ] Apache Spark commented on SPARK-36232: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/34299 > Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled > > > Key: SPARK-36232 > URL: https://issues.apache.org/jira/browse/SPARK-36232 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > > {code:java} > >>> import decimal as d > >>> import pyspark.pandas as ps > >>> import numpy as np > >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled', > >>> True) > >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)]) > 0 1 > 1 2 > 2None > dtype: object > >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled', > >>> False) > >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)]) > 21/07/02 15:01:07 ERROR Executor: Exception in task 6.0 in stage 13.0 (TID 51) > net.razorvine.pickle.PickleException: problem construction object: > java.lang.reflect.InvocationTargetException > ... > {code} > As the code is shown above, we cannot create a Series with `Decimal('NaN')` > when Arrow disabled. We ought to fix that. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36232) Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
[ https://issues.apache.org/jira/browse/SPARK-36232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429518#comment-17429518 ] Apache Spark commented on SPARK-36232: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/34299 > Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled > > > Key: SPARK-36232 > URL: https://issues.apache.org/jira/browse/SPARK-36232 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > > {code:java} > >>> import decimal as d > >>> import pyspark.pandas as ps > >>> import numpy as np > >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled', > >>> True) > >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)]) > 0 1 > 1 2 > 2None > dtype: object > >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled', > >>> False) > >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)]) > 21/07/02 15:01:07 ERROR Executor: Exception in task 6.0 in stage 13.0 (TID 51) > net.razorvine.pickle.PickleException: problem construction object: > java.lang.reflect.InvocationTargetException > ... > {code} > As the code is shown above, we cannot create a Series with `Decimal('NaN')` > when Arrow disabled. We ought to fix that. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36230) hasnans for Series of Decimal(`NaN`)
[ https://issues.apache.org/jira/browse/SPARK-36230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36230: Assignee: (was: Apache Spark) > hasnans for Series of Decimal(`NaN`) > > > Key: SPARK-36230 > URL: https://issues.apache.org/jira/browse/SPARK-36230 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > {code:java} > >>> import pandas as pd > >>> pser = pd.Series([Decimal('0.1'), Decimal('NaN')]) > >>> pser > 00.1 > 1NaN > dtype: object > >>> psser = ps.from_pandas(pser) > >>> psser > 0 0.1 > 1None > dtype: object > >>> psser.hasnans > False > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36230) hasnans for Series of Decimal(`NaN`)
[ https://issues.apache.org/jira/browse/SPARK-36230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429517#comment-17429517 ] Apache Spark commented on SPARK-36230: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/34299 > hasnans for Series of Decimal(`NaN`) > > > Key: SPARK-36230 > URL: https://issues.apache.org/jira/browse/SPARK-36230 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > {code:java} > >>> import pandas as pd > >>> pser = pd.Series([Decimal('0.1'), Decimal('NaN')]) > >>> pser > 00.1 > 1NaN > dtype: object > >>> psser = ps.from_pandas(pser) > >>> psser > 0 0.1 > 1None > dtype: object > >>> psser.hasnans > False > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36232) Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
[ https://issues.apache.org/jira/browse/SPARK-36232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36232: Assignee: (was: Apache Spark) > Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled > > > Key: SPARK-36232 > URL: https://issues.apache.org/jira/browse/SPARK-36232 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > > {code:java} > >>> import decimal as d > >>> import pyspark.pandas as ps > >>> import numpy as np > >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled', > >>> True) > >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)]) > 0 1 > 1 2 > 2None > dtype: object > >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled', > >>> False) > >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)]) > 21/07/02 15:01:07 ERROR Executor: Exception in task 6.0 in stage 13.0 (TID 51) > net.razorvine.pickle.PickleException: problem construction object: > java.lang.reflect.InvocationTargetException > ... > {code} > As the code is shown above, we cannot create a Series with `Decimal('NaN')` > when Arrow disabled. We ought to fix that. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36230) hasnans for Series of Decimal(`NaN`)
[ https://issues.apache.org/jira/browse/SPARK-36230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36230: Assignee: Apache Spark > hasnans for Series of Decimal(`NaN`) > > > Key: SPARK-36230 > URL: https://issues.apache.org/jira/browse/SPARK-36230 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > {code:java} > >>> import pandas as pd > >>> pser = pd.Series([Decimal('0.1'), Decimal('NaN')]) > >>> pser > 00.1 > 1NaN > dtype: object > >>> psser = ps.from_pandas(pser) > >>> psser > 0 0.1 > 1None > dtype: object > >>> psser.hasnans > False > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36232) Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
[ https://issues.apache.org/jira/browse/SPARK-36232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36232: Assignee: Apache Spark > Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled > > > Key: SPARK-36232 > URL: https://issues.apache.org/jira/browse/SPARK-36232 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > > {code:java} > >>> import decimal as d > >>> import pyspark.pandas as ps > >>> import numpy as np > >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled', > >>> True) > >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)]) > 0 1 > 1 2 > 2None > dtype: object > >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled', > >>> False) > >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)]) > 21/07/02 15:01:07 ERROR Executor: Exception in task 6.0 in stage 13.0 (TID 51) > net.razorvine.pickle.PickleException: problem construction object: > java.lang.reflect.InvocationTargetException > ... > {code} > As the code is shown above, we cannot create a Series with `Decimal('NaN')` > when Arrow disabled. We ought to fix that. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36230) hasnans for Series of Decimal(`NaN`)
[ https://issues.apache.org/jira/browse/SPARK-36230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36230: Assignee: (was: Apache Spark) > hasnans for Series of Decimal(`NaN`) > > > Key: SPARK-36230 > URL: https://issues.apache.org/jira/browse/SPARK-36230 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > {code:java} > >>> import pandas as pd > >>> pser = pd.Series([Decimal('0.1'), Decimal('NaN')]) > >>> pser > 00.1 > 1NaN > dtype: object > >>> psser = ps.from_pandas(pser) > >>> psser > 0 0.1 > 1None > dtype: object > >>> psser.hasnans > False > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)
[ https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429515#comment-17429515 ] Yikun Jiang commented on SPARK-36231: - working on this > Support arithmetic operations of Series containing Decimal(np.nan) > --- > > Key: SPARK-36231 > URL: https://issues.apache.org/jira/browse/SPARK-36231 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > Arithmetic operations of Series containing Decimal(np.nan) raise > java.lang.NullPointerException in driver. An example is shown as below: > {code:java} > >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), > >>> decimal.Decimal(np.nan)]) > >>> psser = ps.from_pandas(pser) > >>> pser + 1 > 0 2 > 1 3 > 2 NaN > >>> psser + 1 > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107) > at > org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.s
[jira] [Commented] (SPARK-36230) hasnans for Series of Decimal(`NaN`)
[ https://issues.apache.org/jira/browse/SPARK-36230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429514#comment-17429514 ] Yikun Jiang commented on SPARK-36230: - working on this > hasnans for Series of Decimal(`NaN`) > > > Key: SPARK-36230 > URL: https://issues.apache.org/jira/browse/SPARK-36230 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > {code:java} > >>> import pandas as pd > >>> pser = pd.Series([Decimal('0.1'), Decimal('NaN')]) > >>> pser > 00.1 > 1NaN > dtype: object > >>> psser = ps.from_pandas(pser) > >>> psser > 0 0.1 > 1None > dtype: object > >>> psser.hasnans > False > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34960) Aggregate (Min/Max/Count) push down for ORC
[ https://issues.apache.org/jira/browse/SPARK-34960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34960: Assignee: Apache Spark > Aggregate (Min/Max/Count) push down for ORC > --- > > Key: SPARK-34960 > URL: https://issues.apache.org/jira/browse/SPARK-34960 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Assignee: Apache Spark >Priority: Minor > > Similar to Parquet (https://issues.apache.org/jira/browse/SPARK-34952), we > can also push down certain aggregations into ORC. ORC exposes column > statistics in interface `org.apache.orc.Reader` > ([https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Reader.java#L118] > ), where Spark can utilize for aggregation push down. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34960) Aggregate (Min/Max/Count) push down for ORC
[ https://issues.apache.org/jira/browse/SPARK-34960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34960: Assignee: (was: Apache Spark) > Aggregate (Min/Max/Count) push down for ORC > --- > > Key: SPARK-34960 > URL: https://issues.apache.org/jira/browse/SPARK-34960 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Minor > > Similar to Parquet (https://issues.apache.org/jira/browse/SPARK-34952), we > can also push down certain aggregations into ORC. ORC exposes column > statistics in interface `org.apache.orc.Reader` > ([https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Reader.java#L118] > ), where Spark can utilize for aggregation push down. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34960) Aggregate (Min/Max/Count) push down for ORC
[ https://issues.apache.org/jira/browse/SPARK-34960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429513#comment-17429513 ] Apache Spark commented on SPARK-34960: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/34298 > Aggregate (Min/Max/Count) push down for ORC > --- > > Key: SPARK-34960 > URL: https://issues.apache.org/jira/browse/SPARK-34960 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Minor > > Similar to Parquet (https://issues.apache.org/jira/browse/SPARK-34952), we > can also push down certain aggregations into ORC. ORC exposes column > statistics in interface `org.apache.orc.Reader` > ([https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Reader.java#L118] > ), where Spark can utilize for aggregation push down. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.
[ https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37017: - Issue Type: Bug (was: Improvement) > Reduce the scope of synchronized to prevent deadlock. > - > > Key: SPARK-37017 > URL: https://issues.apache.org/jira/browse/SPARK-37017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Zhixiong Chen >Priority: Minor > > There is a synchronized in CatalogManager.currentNamespace function. > Sometimes a deadlock occurs. > The scope of synchronized can be reduced to prevent deadlock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37022: Assignee: Apache Spark > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Major > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37022: Assignee: (was: Apache Spark) > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429491#comment-17429491 ] Apache Spark commented on SPARK-37022: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/34297 > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429482#comment-17429482 ] Maciej Szymkiewicz commented on SPARK-37022: cc [~hyukjin.kwon] > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429477#comment-17429477 ] Maciej Szymkiewicz edited comment on SPARK-37022 at 10/15/21, 8:45 PM: --- The attached files show git diff stats for a given configuration. I'll open a draft PR soon, to better visualize the extent of required changes (https://github.com/apache/spark/pull/34297) was (Author: zero323): The attached files show git diff stats for a given configuration. I'll open a draft PR soon, to better visualize the extent of required changes. > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-37022: --- Attachment: black-diff-stats.txt > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-37022: --- Attachment: (was: black-diff-stats.txt) > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429477#comment-17429477 ] Maciej Szymkiewicz commented on SPARK-37022: The attached files show git diff stats for a given configuration. I'll open a draft PR soon, to better visualize the extent of required changes. > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-37022: --- Attachment: black-diff-stats.txt > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-37022: --- Attachment: pyproject.toml > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
Maciej Szymkiewicz created SPARK-37022: -- Summary: Use black as a formatter for the whole PySpark codebase. Key: SPARK-37022 URL: https://issues.apache.org/jira/browse/SPARK-37022 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Maciej Szymkiewicz [{{black}}|https://github.com/psf/black] is a popular Python code formatter. It is used by a number of projects, both small and large, including prominent ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used to format a {{pyspark.pandas}} and (though not enforced) stubs files. We should consider using black to enforce formatting of all PySpark files. There are multiple reasons to do that: - Consistency: black is already used across existing codebase and black formatted chunks of code are already added to modules other than pyspark.pandas as a result of type hints inlining (SPARK-36845). - Lower cost of contributing and reviewing: Formatting can be automatically enforced and applied. - Simplify reviews: In general, black formatted code, produces small and highly readable diffs. Risks: - Initial reformatting requires quite significant changes. - Applying black will break blame in GitHub UI (for git in general see [Avoiding ruining git blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). Additional steps: - To simplify backporting, black will have to be applied to all active branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36910) Inline type hints for python/pyspark/sql/types.py
[ https://issues.apache.org/jira/browse/SPARK-36910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-36910. --- Fix Version/s: 3.3.0 Assignee: Xinrong Meng Resolution: Fixed Issue resolved by pull request 34174 https://github.com/apache/spark/pull/34174 > Inline type hints for python/pyspark/sql/types.py > - > > Key: SPARK-36910 > URL: https://issues.apache.org/jira/browse/SPARK-36910 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Inline type hints for python/pyspark/sql/types.py -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36991) Inline type hints for spark/python/pyspark/sql/streaming.py
[ https://issues.apache.org/jira/browse/SPARK-36991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-36991. --- Fix Version/s: 3.3.0 Assignee: Xinrong Meng Resolution: Fixed Issue resolved by pull request 34277 https://github.com/apache/spark/pull/34277 > Inline type hints for spark/python/pyspark/sql/streaming.py > --- > > Key: SPARK-36991 > URL: https://issues.apache.org/jira/browse/SPARK-36991 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Inline type hints for spark/python/pyspark/sql/streaming.py -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37020) Limit push down in DS V2
[ https://issues.apache.org/jira/browse/SPARK-37020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37020: Assignee: (was: Apache Spark) > Limit push down in DS V2 > > > Key: SPARK-37020 > URL: https://issues.apache.org/jira/browse/SPARK-37020 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37020) Limit push down in DS V2
[ https://issues.apache.org/jira/browse/SPARK-37020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429396#comment-17429396 ] Apache Spark commented on SPARK-37020: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/34291 > Limit push down in DS V2 > > > Key: SPARK-37020 > URL: https://issues.apache.org/jira/browse/SPARK-37020 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36989) Migrate type hint data tests
[ https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429394#comment-17429394 ] Apache Spark commented on SPARK-36989: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/34296 > Migrate type hint data tests > > > Key: SPARK-36989 > URL: https://issues.apache.org/jira/browse/SPARK-36989 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Before the migration, {{pyspark-stubs}} contained a set of [data > tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit], > modeled after, and using internal test utilities, of mypy. > These were omitted during the migration for a few reasons: > * Simplicity. > * Relative slowness. > * Dependence on non public API. > > Data tests are useful for a number of reasons: > > * Improve test coverage for type hints. > * Checking if type checkers infer expected types. > * Checking if type checkers reject incorrect code. > * Detecting unusual errors with code that otherwise type checks, > > Especially, the last two functions are not fulfilled by simple validation of > existing codebase. > > Data tests are not required for all annotations and can be restricted to code > that has high possibility of failure: > * Complex overloaded signatures. > * Complex generics. > * Generic {{self}} annotations > * Code containing {{type: ignore}} > The biggest risk, is that output matchers have to be updated when signature > changes and / or mypy output changes. > Example of problem detected with data tests can be found in SPARK-36894 PR > ([https://github.com/apache/spark/pull/34146]). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37020) Limit push down in DS V2
[ https://issues.apache.org/jira/browse/SPARK-37020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37020: Assignee: Apache Spark > Limit push down in DS V2 > > > Key: SPARK-37020 > URL: https://issues.apache.org/jira/browse/SPARK-37020 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36989) Migrate type hint data tests
[ https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36989: Assignee: Apache Spark > Migrate type hint data tests > > > Key: SPARK-36989 > URL: https://issues.apache.org/jira/browse/SPARK-36989 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Major > > Before the migration, {{pyspark-stubs}} contained a set of [data > tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit], > modeled after, and using internal test utilities, of mypy. > These were omitted during the migration for a few reasons: > * Simplicity. > * Relative slowness. > * Dependence on non public API. > > Data tests are useful for a number of reasons: > > * Improve test coverage for type hints. > * Checking if type checkers infer expected types. > * Checking if type checkers reject incorrect code. > * Detecting unusual errors with code that otherwise type checks, > > Especially, the last two functions are not fulfilled by simple validation of > existing codebase. > > Data tests are not required for all annotations and can be restricted to code > that has high possibility of failure: > * Complex overloaded signatures. > * Complex generics. > * Generic {{self}} annotations > * Code containing {{type: ignore}} > The biggest risk, is that output matchers have to be updated when signature > changes and / or mypy output changes. > Example of problem detected with data tests can be found in SPARK-36894 PR > ([https://github.com/apache/spark/pull/34146]). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36989) Migrate type hint data tests
[ https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36989: Assignee: (was: Apache Spark) > Migrate type hint data tests > > > Key: SPARK-36989 > URL: https://issues.apache.org/jira/browse/SPARK-36989 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Before the migration, {{pyspark-stubs}} contained a set of [data > tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit], > modeled after, and using internal test utilities, of mypy. > These were omitted during the migration for a few reasons: > * Simplicity. > * Relative slowness. > * Dependence on non public API. > > Data tests are useful for a number of reasons: > > * Improve test coverage for type hints. > * Checking if type checkers infer expected types. > * Checking if type checkers reject incorrect code. > * Detecting unusual errors with code that otherwise type checks, > > Especially, the last two functions are not fulfilled by simple validation of > existing codebase. > > Data tests are not required for all annotations and can be restricted to code > that has high possibility of failure: > * Complex overloaded signatures. > * Complex generics. > * Generic {{self}} annotations > * Code containing {{type: ignore}} > The biggest risk, is that output matchers have to be updated when signature > changes and / or mypy output changes. > Example of problem detected with data tests can be found in SPARK-36894 PR > ([https://github.com/apache/spark/pull/34146]). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37020) Limit push down in DS V2
[ https://issues.apache.org/jira/browse/SPARK-37020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429395#comment-17429395 ] Apache Spark commented on SPARK-37020: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/34291 > Limit push down in DS V2 > > > Key: SPARK-37020 > URL: https://issues.apache.org/jira/browse/SPARK-37020 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37021) JDBC option "sessionInitStatement" does not execute set sql statement when resolving a table
[ https://issues.apache.org/jira/browse/SPARK-37021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Valery Meleshkin updated SPARK-37021: - Description: If {{sessionInitStatement}} is required to grant permissions or resolve an ambiguity, schema resolution will fail when reading a JDBC table. Consider the following example running against Oracle database: {code:scala} reader.format("jdbc").options( Map( "url" -> jdbcUrl, "dbtable" -> "SELECT * FROM FOO", "user" -> "BOB", "sessionInitStatement" -> """ALTER SESSION SET CURRENT_SCHEMA = "BAR, "password" -> password )).load {code} Table {{FOO}} is in schema {{BAR}}, but default value for {{CURRENT_SCHEMA}} for the JDBC connection will be {{BOB}}. Therefore, the code above will fail with an error ({{ORA-00942: table or view does not exist}} if it's Oracle). It happens because [resolveTable |https://github.com/apache/spark/blob/9d061e3939a021c602c070fc13cef951a8f94c82/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L67] that is called during planning phase ignores {{sessionInitStatement}}. was: If {{sessionInitStatement}} is required to grant permissions or resolve an ambiguity, schema resolution will fail when reading a JDBC table. Consider the following example running against Oracle database: {code:scala} reader.format("jdbc").options( Map( "url" -> jdbcUrl, "dbtable" -> "SELECT * FROM FOO", "user" -> "BOB", "sessionInitStatement" -> """ALTER SESSION SET CURRENT_SCHEMA = "BAR, "password" -> password )).load {code} Table {{FOO}} is in schema {{BAR}}, but default value for {{CURRENT_SCHEMA}} for the JDBC connection will be {{BOB}}. Therefore, the code above will fail with an error ({{ORA-00942: table or view does not exist}} if it's Oracle). It happens because [resolveTable |resolveTable] that is called during planning phase ignores `sessionInitStatement`. > JDBC option "sessionInitStatement" does not execute set sql statement when > resolving a table > > > Key: SPARK-37021 > URL: https://issues.apache.org/jira/browse/SPARK-37021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.2 >Reporter: Valery Meleshkin >Priority: Major > > If {{sessionInitStatement}} is required to grant permissions or resolve an > ambiguity, schema resolution will fail when reading a JDBC table. > Consider the following example running against Oracle database: > {code:scala} > reader.format("jdbc").options( > Map( > "url" -> jdbcUrl, > "dbtable" -> "SELECT * FROM FOO", > "user" -> "BOB", > "sessionInitStatement" -> """ALTER SESSION SET CURRENT_SCHEMA = "BAR, > "password" -> password > )).load > {code} > Table {{FOO}} is in schema {{BAR}}, but default value for {{CURRENT_SCHEMA}} > for the JDBC connection will be {{BOB}}. Therefore, the code above will fail > with an error ({{ORA-00942: table or view does not exist}} if it's Oracle). > It happens because [resolveTable > |https://github.com/apache/spark/blob/9d061e3939a021c602c070fc13cef951a8f94c82/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L67] > that is called during planning phase ignores {{sessionInitStatement}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37021) JDBC option "sessionInitStatement" does not execute set sql statement when resolving a table
Valery Meleshkin created SPARK-37021: Summary: JDBC option "sessionInitStatement" does not execute set sql statement when resolving a table Key: SPARK-37021 URL: https://issues.apache.org/jira/browse/SPARK-37021 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.2 Reporter: Valery Meleshkin If {{sessionInitStatement}} is required to grant permissions or resolve an ambiguity, schema resolution will fail when reading a JDBC table. Consider the following example running against Oracle database: {code:scala} reader.format("jdbc").options( Map( "url" -> jdbcUrl, "dbtable" -> "SELECT * FROM FOO", "user" -> "BOB", "sessionInitStatement" -> """ALTER SESSION SET CURRENT_SCHEMA = "BAR, "password" -> password )).load {code} Table {{FOO}} is in schema {{BAR}}, but default value for {{CURRENT_SCHEMA}} for the JDBC connection will be {{BOB}}. Therefore, the code above will fail with an error ({{ORA-00942: table or view does not exist}} if it's Oracle). It happens because [resolveTable |resolveTable] that is called during planning phase ignores `sessionInitStatement`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37020) Limit push down in DS V2
Huaxin Gao created SPARK-37020: -- Summary: Limit push down in DS V2 Key: SPARK-37020 URL: https://issues.apache.org/jira/browse/SPARK-37020 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Huaxin Gao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36276) Update maven-checkstyle-plugin to 3.1.2 and checkstyle to 8.43
[ https://issues.apache.org/jira/browse/SPARK-36276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429343#comment-17429343 ] Apache Spark commented on SPARK-36276: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34295 > Update maven-checkstyle-plugin to 3.1.2 and checkstyle to 8.43 > -- > > Key: SPARK-36276 > URL: https://issues.apache.org/jira/browse/SPARK-36276 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 3.3.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36276) Update maven-checkstyle-plugin to 3.1.2 and checkstyle to 8.43
[ https://issues.apache.org/jira/browse/SPARK-36276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429341#comment-17429341 ] Apache Spark commented on SPARK-36276: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34295 > Update maven-checkstyle-plugin to 3.1.2 and checkstyle to 8.43 > -- > > Key: SPARK-36276 > URL: https://issues.apache.org/jira/browse/SPARK-36276 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 3.3.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35926) Support YearMonthIntervalType in width-bucket function
[ https://issues.apache.org/jira/browse/SPARK-35926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-35926. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33132 [https://github.com/apache/spark/pull/33132] > Support YearMonthIntervalType in width-bucket function > -- > > Key: SPARK-35926 > URL: https://issues.apache.org/jira/browse/SPARK-35926 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: PengLei >Assignee: PengLei >Priority: Major > Fix For: 3.3.0 > > > At now, width-bucket support the type [DoubleType, DoubleType, DoubleType, > LongType], > we hope that support[YearMonthIntervalType, YearMonthIntervalType, > YearMonthIntervalType, LongType] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35926) Support YearMonthIntervalType in width-bucket function
[ https://issues.apache.org/jira/browse/SPARK-35926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-35926: Assignee: PengLei > Support YearMonthIntervalType in width-bucket function > -- > > Key: SPARK-35926 > URL: https://issues.apache.org/jira/browse/SPARK-35926 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: PengLei >Assignee: PengLei >Priority: Major > > At now, width-bucket support the type [DoubleType, DoubleType, DoubleType, > LongType], > we hope that support[YearMonthIntervalType, YearMonthIntervalType, > YearMonthIntervalType, LongType] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37019) Add Codegen support to ArrayTransform
[ https://issues.apache.org/jira/browse/SPARK-37019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429254#comment-17429254 ] Apache Spark commented on SPARK-37019: -- User 'Kimahriman' has created a pull request for this issue: https://github.com/apache/spark/pull/34294 > Add Codegen support to ArrayTransform > - > > Key: SPARK-37019 > URL: https://issues.apache.org/jira/browse/SPARK-37019 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Adam Binford >Priority: Major > > Currently all of the higher order functions use CodegenFallback. We can > improve the performance of these by adding proper codegen support, so the > function as well as all children can be codegen'd, and it can participate in > WholeStageCodegen. > This ticket is for adding support to ArrayTransform as the first step. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37019) Add Codegen support to ArrayTransform
[ https://issues.apache.org/jira/browse/SPARK-37019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37019: Assignee: (was: Apache Spark) > Add Codegen support to ArrayTransform > - > > Key: SPARK-37019 > URL: https://issues.apache.org/jira/browse/SPARK-37019 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Adam Binford >Priority: Major > > Currently all of the higher order functions use CodegenFallback. We can > improve the performance of these by adding proper codegen support, so the > function as well as all children can be codegen'd, and it can participate in > WholeStageCodegen. > This ticket is for adding support to ArrayTransform as the first step. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37019) Add Codegen support to ArrayTransform
[ https://issues.apache.org/jira/browse/SPARK-37019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37019: Assignee: Apache Spark > Add Codegen support to ArrayTransform > - > > Key: SPARK-37019 > URL: https://issues.apache.org/jira/browse/SPARK-37019 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Adam Binford >Assignee: Apache Spark >Priority: Major > > Currently all of the higher order functions use CodegenFallback. We can > improve the performance of these by adding proper codegen support, so the > function as well as all children can be codegen'd, and it can participate in > WholeStageCodegen. > This ticket is for adding support to ArrayTransform as the first step. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37019) Add Codegen support to ArrayTransform
Adam Binford created SPARK-37019: Summary: Add Codegen support to ArrayTransform Key: SPARK-37019 URL: https://issues.apache.org/jira/browse/SPARK-37019 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Adam Binford Currently all of the higher order functions use CodegenFallback. We can improve the performance of these by adding proper codegen support, so the function as well as all children can be codegen'd, and it can participate in WholeStageCodegen. This ticket is for adding support to ArrayTransform as the first step. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36987) Add Doc about FROM statement
[ https://issues.apache.org/jira/browse/SPARK-36987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu resolved SPARK-36987. --- Resolution: Not A Problem > Add Doc about FROM statement > > > Key: SPARK-36987 > URL: https://issues.apache.org/jira/browse/SPARK-36987 > Project: Spark > Issue Type: Task > Components: docs >Affects Versions: 3.2.1 >Reporter: angerszhu >Priority: Major > > Add Doc about FROM statement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37018) Spark SQL should support create function with Aggregator
[ https://issues.apache.org/jira/browse/SPARK-37018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429198#comment-17429198 ] jiaan.geng commented on SPARK-37018: I'm working. > Spark SQL should support create function with Aggregator > > > Key: SPARK-37018 > URL: https://issues.apache.org/jira/browse/SPARK-37018 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiaan.geng >Priority: Major > > Spark SQL not support create function with Aggregator and deprecated > UserDefinedAggregateFunction. > If we remove UserDefinedAggregateFunction, Spark SQL should provide a new > option. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37018) Spark SQL should support create function with Aggregator
jiaan.geng created SPARK-37018: -- Summary: Spark SQL should support create function with Aggregator Key: SPARK-37018 URL: https://issues.apache.org/jira/browse/SPARK-37018 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: jiaan.geng Spark SQL not support create function with Aggregator and deprecated UserDefinedAggregateFunction. If we remove UserDefinedAggregateFunction, Spark SQL should provide a new option. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37016) Publicise UpperCaseCharStream
[ https://issues.apache.org/jira/browse/SPARK-37016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dohongdayi updated SPARK-37016: --- Description: Many Spark extension projects are copying `UpperCaseCharStream` because it is private beneath `parser` package, such as: [Delta Lake|https://github.com/delta-io/delta/blob/625de3b305f109441ad04b20dba91dd6c4e1d78e/core/src/main/scala/io/delta/sql/parser/DeltaSqlParser.scala#L290] [Hudi|https://github.com/apache/hudi/blob/3f8ca1a3552bb866163d3b1648f68d9c4824e21d/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieCommonSqlParser.scala#L112] [Iceberg|https://github.com/apache/iceberg/blob/c3ac4c6ca74a0013b4705d5bd5d17fade8e6f499/spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala#L175] [Submarine|https://github.com/apache/submarine/blob/2faebb8efd69833853f62d89b4f1fea1b1718148/submarine-security/spark-security/src/main/scala/org/apache/submarine/spark/security/parser/UpperCaseCharStream.scala#L31] [Kyuubi|https://github.com/apache/incubator-kyuubi/blob/8a5134e3223844714fc58833a6859d4df5b68d57/dev/kyuubi-extension-spark-common/src/main/scala/org/apache/kyuubi/sql/zorder/ZorderSparkSqlExtensionsParserBase.scala#L108] [Spark-ACID|https://github.com/qubole/spark-acid/blob/19bd6db757677c40f448e85c74d9995ba97d5942/src/main/scala/com/qubole/spark/datasources/hiveacid/sql/catalyst/parser/ParseDriver.scala#L13] We can publicise `UpperCaseCharStream` to eliminate code duplication. was: Many Spark extension projects are copying `UpperCaseCharStream` because it is private beneath `parser` package, such as: [Hudi|https://github.com/apache/hudi/blob/3f8ca1a3552bb866163d3b1648f68d9c4824e21d/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieCommonSqlParser.scala#L112] [Iceberg|https://github.com/apache/iceberg/blob/c3ac4c6ca74a0013b4705d5bd5d17fade8e6f499/spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala#L175] [Delta Lake|https://github.com/delta-io/delta/blob/625de3b305f109441ad04b20dba91dd6c4e1d78e/core/src/main/scala/io/delta/sql/parser/DeltaSqlParser.scala#L290] [Submarine|https://github.com/apache/submarine/blob/2faebb8efd69833853f62d89b4f1fea1b1718148/submarine-security/spark-security/src/main/scala/org/apache/submarine/spark/security/parser/UpperCaseCharStream.scala#L31] [Kyuubi|https://github.com/apache/incubator-kyuubi/blob/8a5134e3223844714fc58833a6859d4df5b68d57/dev/kyuubi-extension-spark-common/src/main/scala/org/apache/kyuubi/sql/zorder/ZorderSparkSqlExtensionsParserBase.scala#L108] [Spark-ACID|https://github.com/qubole/spark-acid/blob/19bd6db757677c40f448e85c74d9995ba97d5942/src/main/scala/com/qubole/spark/datasources/hiveacid/sql/catalyst/parser/ParseDriver.scala#L13] We can publicise `UpperCaseCharStream` to eliminate code duplication. > Publicise UpperCaseCharStream > - > > Key: SPARK-37016 > URL: https://issues.apache.org/jira/browse/SPARK-37016 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.1, 3.1.2, 3.2.0 >Reporter: dohongdayi >Priority: Major > Fix For: 2.4.9, 3.1.3, 3.0.4, 3.2.1, 3.3.0 > > > Many Spark extension projects are copying `UpperCaseCharStream` because it is > private beneath `parser` package, such as: > [Delta > Lake|https://github.com/delta-io/delta/blob/625de3b305f109441ad04b20dba91dd6c4e1d78e/core/src/main/scala/io/delta/sql/parser/DeltaSqlParser.scala#L290] > [Hudi|https://github.com/apache/hudi/blob/3f8ca1a3552bb866163d3b1648f68d9c4824e21d/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieCommonSqlParser.scala#L112] > [Iceberg|https://github.com/apache/iceberg/blob/c3ac4c6ca74a0013b4705d5bd5d17fade8e6f499/spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala#L175] > [Submarine|https://github.com/apache/submarine/blob/2faebb8efd69833853f62d89b4f1fea1b1718148/submarine-security/spark-security/src/main/scala/org/apache/submarine/spark/security/parser/UpperCaseCharStream.scala#L31] > [Kyuubi|https://github.com/apache/incubator-kyuubi/blob/8a5134e3223844714fc58833a6859d4df5b68d57/dev/kyuubi-extension-spark-common/src/main/scala/org/apache/kyuubi/sql/zorder/ZorderSparkSqlExtensionsParserBase.scala#L108] > [Spark-ACID|https://github.com/qubole/spark-acid/blob/19bd6db757677c40f448e85c74d9995ba97d5942/src/main/scala/com/qubole/spark/datasources/hiveacid/sql/catalyst/parser/ParseDriver.scala#L13] > We can publicise `UpperCaseCharStream` to eliminate code duplication. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.
[ https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429162#comment-17429162 ] Zhixiong Chen commented on SPARK-37017: --- I have created a pull request for this issue: https://github.com/apache/spark/pull/34292 > Reduce the scope of synchronized to prevent deadlock. > - > > Key: SPARK-37017 > URL: https://issues.apache.org/jira/browse/SPARK-37017 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Zhixiong Chen >Priority: Minor > > There is a synchronized in CatalogManager.currentNamespace function. > Sometimes a deadlock occurs. > The scope of synchronized can be reduced to prevent deadlock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37014) Inline type hints for python/pyspark/streaming/context.py
[ https://issues.apache.org/jira/browse/SPARK-37014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37014: Assignee: Apache Spark > Inline type hints for python/pyspark/streaming/context.py > - > > Key: SPARK-37014 > URL: https://issues.apache.org/jira/browse/SPARK-37014 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37014) Inline type hints for python/pyspark/streaming/context.py
[ https://issues.apache.org/jira/browse/SPARK-37014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37014: Assignee: (was: Apache Spark) > Inline type hints for python/pyspark/streaming/context.py > - > > Key: SPARK-37014 > URL: https://issues.apache.org/jira/browse/SPARK-37014 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.
[ https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37017: Assignee: (was: Apache Spark) > Reduce the scope of synchronized to prevent deadlock. > - > > Key: SPARK-37017 > URL: https://issues.apache.org/jira/browse/SPARK-37017 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Zhixiong Chen >Priority: Minor > > There is a synchronized in CatalogManager.currentNamespace function. > Sometimes a deadlock occurs. > The scope of synchronized can be reduced to prevent deadlock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37014) Inline type hints for python/pyspark/streaming/context.py
[ https://issues.apache.org/jira/browse/SPARK-37014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429179#comment-17429179 ] Apache Spark commented on SPARK-37014: -- User 'dchvn' has created a pull request for this issue: https://github.com/apache/spark/pull/34293 > Inline type hints for python/pyspark/streaming/context.py > - > > Key: SPARK-37014 > URL: https://issues.apache.org/jira/browse/SPARK-37014 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.
[ https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429178#comment-17429178 ] Apache Spark commented on SPARK-37017: -- User 'chenzhx' has created a pull request for this issue: https://github.com/apache/spark/pull/34292 > Reduce the scope of synchronized to prevent deadlock. > - > > Key: SPARK-37017 > URL: https://issues.apache.org/jira/browse/SPARK-37017 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Zhixiong Chen >Priority: Minor > > There is a synchronized in CatalogManager.currentNamespace function. > Sometimes a deadlock occurs. > The scope of synchronized can be reduced to prevent deadlock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.
[ https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37017: Assignee: Apache Spark > Reduce the scope of synchronized to prevent deadlock. > - > > Key: SPARK-37017 > URL: https://issues.apache.org/jira/browse/SPARK-37017 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Zhixiong Chen >Assignee: Apache Spark >Priority: Minor > > There is a synchronized in CatalogManager.currentNamespace function. > Sometimes a deadlock occurs. > The scope of synchronized can be reduced to prevent deadlock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.
[ https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhixiong Chen updated SPARK-37017: -- Comment: was deleted (was: I have created a pull request for this issue: [https://github.com/apache/spark/pull/34292]) > Reduce the scope of synchronized to prevent deadlock. > - > > Key: SPARK-37017 > URL: https://issues.apache.org/jira/browse/SPARK-37017 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Zhixiong Chen >Priority: Minor > > There is a synchronized in CatalogManager.currentNamespace function. > Sometimes a deadlock occurs. > The scope of synchronized can be reduced to prevent deadlock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.
[ https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhixiong Chen updated SPARK-37017: -- Comment: was deleted (was: I have created a pull request for this issue: [https://github.com/apache/spark/pull/34292]) > Reduce the scope of synchronized to prevent deadlock. > - > > Key: SPARK-37017 > URL: https://issues.apache.org/jira/browse/SPARK-37017 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Zhixiong Chen >Priority: Minor > > There is a synchronized in CatalogManager.currentNamespace function. > Sometimes a deadlock occurs. > The scope of synchronized can be reduced to prevent deadlock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.
[ https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429161#comment-17429161 ] Zhixiong Chen commented on SPARK-37017: --- I have created a pull request for this issue: [https://github.com/apache/spark/pull/34292] > Reduce the scope of synchronized to prevent deadlock. > - > > Key: SPARK-37017 > URL: https://issues.apache.org/jira/browse/SPARK-37017 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Zhixiong Chen >Priority: Minor > > There is a synchronized in CatalogManager.currentNamespace function. > Sometimes a deadlock occurs. > The scope of synchronized can be reduced to prevent deadlock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.
[ https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429160#comment-17429160 ] Zhixiong Chen commented on SPARK-37017: --- I have created a pull request for this issue: [https://github.com/apache/spark/pull/34292] > Reduce the scope of synchronized to prevent deadlock. > - > > Key: SPARK-37017 > URL: https://issues.apache.org/jira/browse/SPARK-37017 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Zhixiong Chen >Priority: Minor > > There is a synchronized in CatalogManager.currentNamespace function. > Sometimes a deadlock occurs. > The scope of synchronized can be reduced to prevent deadlock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.
Zhixiong Chen created SPARK-37017: - Summary: Reduce the scope of synchronized to prevent deadlock. Key: SPARK-37017 URL: https://issues.apache.org/jira/browse/SPARK-37017 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.1 Reporter: Zhixiong Chen There is a synchronized in CatalogManager.currentNamespace function. Sometimes a deadlock occurs. The scope of synchronized can be reduced to prevent deadlock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org