date:20211015



[ 
https://issues.apache.org/jira/browse/SPARK-36232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429519#comment-17429519
 ] 

Apache Spark commented on SPARK-36232:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34299

> Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
> 
>
> Key: SPARK-36232
> URL: https://issues.apache.org/jira/browse/SPARK-36232
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> {code:java}
> >>> import decimal as d
> >>> import pyspark.pandas as ps
> >>> import numpy as np
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  True)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 0   1
> 1   2
> 2None
> dtype: object
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  False)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 21/07/02 15:01:07 ERROR Executor: Exception in task 6.0 in stage 13.0 (TID 51)
> net.razorvine.pickle.PickleException: problem construction object: 
> java.lang.reflect.InvocationTargetException
> ...
> {code}
> As the code is shown above, we cannot create a Series with `Decimal('NaN')` 
> when Arrow disabled. We ought to fix that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36232) Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled



[ 
https://issues.apache.org/jira/browse/SPARK-36232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429518#comment-17429518
 ] 

Apache Spark commented on SPARK-36232:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34299

> Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
> 
>
> Key: SPARK-36232
> URL: https://issues.apache.org/jira/browse/SPARK-36232
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> {code:java}
> >>> import decimal as d
> >>> import pyspark.pandas as ps
> >>> import numpy as np
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  True)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 0   1
> 1   2
> 2None
> dtype: object
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  False)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 21/07/02 15:01:07 ERROR Executor: Exception in task 6.0 in stage 13.0 (TID 51)
> net.razorvine.pickle.PickleException: problem construction object: 
> java.lang.reflect.InvocationTargetException
> ...
> {code}
> As the code is shown above, we cannot create a Series with `Decimal('NaN')` 
> when Arrow disabled. We ought to fix that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36230) hasnans for Series of Decimal(`NaN`)



 [ 
https://issues.apache.org/jira/browse/SPARK-36230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36230:


Assignee: (was: Apache Spark)

> hasnans for Series of Decimal(`NaN`)
> 
>
> Key: SPARK-36230
> URL: https://issues.apache.org/jira/browse/SPARK-36230
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {code:java}
> >>> import pandas as pd
> >>> pser = pd.Series([Decimal('0.1'), Decimal('NaN')])
> >>> pser
> 00.1
> 1NaN
> dtype: object
> >>> psser = ps.from_pandas(pser)
> >>> psser
> 0 0.1
> 1None
> dtype: object
> >>> psser.hasnans
> False
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36230) hasnans for Series of Decimal(`NaN`)



[ 
https://issues.apache.org/jira/browse/SPARK-36230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429517#comment-17429517
 ] 

Apache Spark commented on SPARK-36230:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34299

> hasnans for Series of Decimal(`NaN`)
> 
>
> Key: SPARK-36230
> URL: https://issues.apache.org/jira/browse/SPARK-36230
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {code:java}
> >>> import pandas as pd
> >>> pser = pd.Series([Decimal('0.1'), Decimal('NaN')])
> >>> pser
> 00.1
> 1NaN
> dtype: object
> >>> psser = ps.from_pandas(pser)
> >>> psser
> 0 0.1
> 1None
> dtype: object
> >>> psser.hasnans
> False
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36232) Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled



 [ 
https://issues.apache.org/jira/browse/SPARK-36232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36232:


Assignee: (was: Apache Spark)

> Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
> 
>
> Key: SPARK-36232
> URL: https://issues.apache.org/jira/browse/SPARK-36232
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> {code:java}
> >>> import decimal as d
> >>> import pyspark.pandas as ps
> >>> import numpy as np
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  True)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 0   1
> 1   2
> 2None
> dtype: object
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  False)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 21/07/02 15:01:07 ERROR Executor: Exception in task 6.0 in stage 13.0 (TID 51)
> net.razorvine.pickle.PickleException: problem construction object: 
> java.lang.reflect.InvocationTargetException
> ...
> {code}
> As the code is shown above, we cannot create a Series with `Decimal('NaN')` 
> when Arrow disabled. We ought to fix that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36230) hasnans for Series of Decimal(`NaN`)



 [ 
https://issues.apache.org/jira/browse/SPARK-36230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36230:


Assignee: Apache Spark

> hasnans for Series of Decimal(`NaN`)
> 
>
> Key: SPARK-36230
> URL: https://issues.apache.org/jira/browse/SPARK-36230
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> >>> import pandas as pd
> >>> pser = pd.Series([Decimal('0.1'), Decimal('NaN')])
> >>> pser
> 00.1
> 1NaN
> dtype: object
> >>> psser = ps.from_pandas(pser)
> >>> psser
> 0 0.1
> 1None
> dtype: object
> >>> psser.hasnans
> False
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36232) Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled



 [ 
https://issues.apache.org/jira/browse/SPARK-36232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36232:


Assignee: Apache Spark

> Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
> 
>
> Key: SPARK-36232
> URL: https://issues.apache.org/jira/browse/SPARK-36232
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
>  
> {code:java}
> >>> import decimal as d
> >>> import pyspark.pandas as ps
> >>> import numpy as np
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  True)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 0   1
> 1   2
> 2None
> dtype: object
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  False)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 21/07/02 15:01:07 ERROR Executor: Exception in task 6.0 in stage 13.0 (TID 51)
> net.razorvine.pickle.PickleException: problem construction object: 
> java.lang.reflect.InvocationTargetException
> ...
> {code}
> As the code is shown above, we cannot create a Series with `Decimal('NaN')` 
> when Arrow disabled. We ought to fix that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36230) hasnans for Series of Decimal(`NaN`)



 [ 
https://issues.apache.org/jira/browse/SPARK-36230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36230:


Assignee: (was: Apache Spark)

> hasnans for Series of Decimal(`NaN`)
> 
>
> Key: SPARK-36230
> URL: https://issues.apache.org/jira/browse/SPARK-36230
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {code:java}
> >>> import pandas as pd
> >>> pser = pd.Series([Decimal('0.1'), Decimal('NaN')])
> >>> pser
> 00.1
> 1NaN
> dtype: object
> >>> psser = ps.from_pandas(pser)
> >>> psser
> 0 0.1
> 1None
> dtype: object
> >>> psser.hasnans
> False
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)

2021-10-15 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429515#comment-17429515
 ] 

Yikun Jiang commented on SPARK-36231:
-

working on this

> Support arithmetic operations of Series containing Decimal(np.nan) 
> ---
>
> Key: SPARK-36231
> URL: https://issues.apache.org/jira/browse/SPARK-36231
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Arithmetic operations of Series containing Decimal(np.nan) raise 
> java.lang.NullPointerException in driver. An example is shown as below:
> {code:java}
> >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), 
> >>> decimal.Decimal(np.nan)])
> >>> psser = ps.from_pandas(pser)
> >>> pser + 1
> 0 2
>  1 3
>  2 NaN
> >>> psser + 1
>  Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084)
>  at scala.Option.foreach(Option.scala:407)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68)
>  Caused by: java.lang.NullPointerException
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>  at 
> org.apache.spark.s

[jira] [Commented] (SPARK-36230) hasnans for Series of Decimal(`NaN`)

2021-10-15 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429514#comment-17429514
 ] 

Yikun Jiang commented on SPARK-36230:
-

working on this

> hasnans for Series of Decimal(`NaN`)
> 
>
> Key: SPARK-36230
> URL: https://issues.apache.org/jira/browse/SPARK-36230
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {code:java}
> >>> import pandas as pd
> >>> pser = pd.Series([Decimal('0.1'), Decimal('NaN')])
> >>> pser
> 00.1
> 1NaN
> dtype: object
> >>> psser = ps.from_pandas(pser)
> >>> psser
> 0 0.1
> 1None
> dtype: object
> >>> psser.hasnans
> False
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34960) Aggregate (Min/Max/Count) push down for ORC



 [ 
https://issues.apache.org/jira/browse/SPARK-34960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34960:


Assignee: Apache Spark

> Aggregate (Min/Max/Count) push down for ORC
> ---
>
> Key: SPARK-34960
> URL: https://issues.apache.org/jira/browse/SPARK-34960
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Minor
>
> Similar to Parquet (https://issues.apache.org/jira/browse/SPARK-34952), we 
> can also push down certain aggregations into ORC. ORC exposes column 
> statistics in interface `org.apache.orc.Reader` 
> ([https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Reader.java#L118]
>  ), where Spark can utilize for aggregation push down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34960) Aggregate (Min/Max/Count) push down for ORC



 [ 
https://issues.apache.org/jira/browse/SPARK-34960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34960:


Assignee: (was: Apache Spark)

> Aggregate (Min/Max/Count) push down for ORC
> ---
>
> Key: SPARK-34960
> URL: https://issues.apache.org/jira/browse/SPARK-34960
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> Similar to Parquet (https://issues.apache.org/jira/browse/SPARK-34952), we 
> can also push down certain aggregations into ORC. ORC exposes column 
> statistics in interface `org.apache.orc.Reader` 
> ([https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Reader.java#L118]
>  ), where Spark can utilize for aggregation push down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34960) Aggregate (Min/Max/Count) push down for ORC



[ 
https://issues.apache.org/jira/browse/SPARK-34960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429513#comment-17429513
 ] 

Apache Spark commented on SPARK-34960:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/34298

> Aggregate (Min/Max/Count) push down for ORC
> ---
>
> Key: SPARK-34960
> URL: https://issues.apache.org/jira/browse/SPARK-34960
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> Similar to Parquet (https://issues.apache.org/jira/browse/SPARK-34952), we 
> can also push down certain aggregations into ORC. ORC exposes column 
> statistics in interface `org.apache.orc.Reader` 
> ([https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Reader.java#L118]
>  ), where Spark can utilize for aggregation push down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.

2021-10-15 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37017:
-
Issue Type: Bug  (was: Improvement)

> Reduce the scope of synchronized to prevent deadlock.
> -
>
> Key: SPARK-37017
> URL: https://issues.apache.org/jira/browse/SPARK-37017
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Zhixiong Chen
>Priority: Minor
>
> There is a synchronized in CatalogManager.currentNamespace function.
> Sometimes a deadlock occurs.
> The scope of synchronized can be reduced to prevent deadlock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.



 [ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37022:


Assignee: Apache Spark

> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
> - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
> - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
> - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
> Risks:
> - Initial reformatting requires quite significant changes.
> - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
> - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.



 [ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37022:


Assignee: (was: Apache Spark)

> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
> - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
> - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
> - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
> Risks:
> - Initial reformatting requires quite significant changes.
> - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
> - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.



[ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429491#comment-17429491
 ] 

Apache Spark commented on SPARK-37022:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/34297

> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
> - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
> - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
> - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
> Risks:
> - Initial reformatting requires quite significant changes.
> - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
> - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.



[ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429482#comment-17429482
 ] 

Maciej Szymkiewicz commented on SPARK-37022:


cc [~hyukjin.kwon]

> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
> - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
> - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
> - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
> Risks:
> - Initial reformatting requires quite significant changes.
> - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
> - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.



[ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429477#comment-17429477
 ] 

Maciej Szymkiewicz edited comment on SPARK-37022 at 10/15/21, 8:45 PM:
---

The attached files show git diff stats for a given configuration.

 

I'll open a draft PR soon, to better visualize  the extent of required changes 
(https://github.com/apache/spark/pull/34297)


was (Author: zero323):
The attached files show git diff stats for a given configuration.

 

I'll open a draft PR soon, to better visualize  the extent of required changes.

> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
> - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
> - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
> - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
> Risks:
> - Initial reformatting requires quite significant changes.
> - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
> - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.



 [ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-37022:
---
Attachment: black-diff-stats.txt

> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
> - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
> - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
> - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
> Risks:
> - Initial reformatting requires quite significant changes.
> - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
> - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.



 [ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-37022:
---
Attachment: (was: black-diff-stats.txt)

> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
> - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
> - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
> - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
> Risks:
> - Initial reformatting requires quite significant changes.
> - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
> - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.



[ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429477#comment-17429477
 ] 

Maciej Szymkiewicz commented on SPARK-37022:


The attached files show git diff stats for a given configuration.

 

I'll open a draft PR soon, to better visualize  the extent of required changes.

> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
> - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
> - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
> - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
> Risks:
> - Initial reformatting requires quite significant changes.
> - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
> - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.



 [ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-37022:
---
Attachment: black-diff-stats.txt

> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
> - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
> - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
> - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
> Risks:
> - Initial reformatting requires quite significant changes.
> - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
> - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.



 [ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-37022:
---
Attachment: pyproject.toml

> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
> - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
> - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
> - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
> Risks:
> - Initial reformatting requires quite significant changes.
> - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
> - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.

Maciej Szymkiewicz created SPARK-37022:
--

Summary: Use black as a formatter for the whole PySpark codebase.
Key: SPARK-37022
URL: https://issues.apache.org/jira/browse/SPARK-37022
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 3.3.0
Reporter: Maciej Szymkiewicz

[{{black}}|https://github.com/psf/black] is a popular Python code formatter. It
is used by a number of projects, both small and large, including prominent
ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used to
format a {{pyspark.pandas}} and (though not enforced) stubs files.

We should consider using black to enforce formatting of all PySpark files.
There are multiple reasons to do that:

- Consistency: black is already used across existing codebase and black
formatted chunks of code are already added to modules other than pyspark.pandas
as a result of type hints inlining (SPARK-36845).
- Lower cost of contributing and reviewing: Formatting can be automatically
enforced and applied.
- Simplify reviews: In general, black formatted code, produces small and highly
readable diffs.

Risks:

- Initial reformatting requires quite significant changes.
- Applying black will break blame in GitHub UI (for git in general see
[Avoiding ruining git
blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).

Additional steps:

- To simplify backporting, black will have to be applied to all active branches.

--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36910) Inline type hints for python/pyspark/sql/types.py

2021-10-15 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36910.
---
Fix Version/s: 3.3.0
 Assignee: Xinrong Meng
   Resolution: Fixed

Issue resolved by pull request 34174
https://github.com/apache/spark/pull/34174

> Inline type hints for python/pyspark/sql/types.py
> -
>
> Key: SPARK-36910
> URL: https://issues.apache.org/jira/browse/SPARK-36910
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints for python/pyspark/sql/types.py



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36991) Inline type hints for spark/python/pyspark/sql/streaming.py

2021-10-15 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36991.
---
Fix Version/s: 3.3.0
 Assignee: Xinrong Meng
   Resolution: Fixed

Issue resolved by pull request 34277
https://github.com/apache/spark/pull/34277

> Inline type hints for spark/python/pyspark/sql/streaming.py
> ---
>
> Key: SPARK-36991
> URL: https://issues.apache.org/jira/browse/SPARK-36991
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints for spark/python/pyspark/sql/streaming.py



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37020) Limit push down in DS V2



 [ 
https://issues.apache.org/jira/browse/SPARK-37020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37020:


Assignee: (was: Apache Spark)

> Limit push down in DS V2
> 
>
> Key: SPARK-37020
> URL: https://issues.apache.org/jira/browse/SPARK-37020
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37020) Limit push down in DS V2



[ 
https://issues.apache.org/jira/browse/SPARK-37020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429396#comment-17429396
 ] 

Apache Spark commented on SPARK-37020:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34291

> Limit push down in DS V2
> 
>
> Key: SPARK-37020
> URL: https://issues.apache.org/jira/browse/SPARK-37020
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36989) Migrate type hint data tests



[ 
https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429394#comment-17429394
 ] 

Apache Spark commented on SPARK-36989:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/34296

> Migrate type hint data tests
> 
>
> Key: SPARK-36989
> URL: https://issues.apache.org/jira/browse/SPARK-36989
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Before the migration, {{pyspark-stubs}} contained a set of [data 
> tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit],
>  modeled after, and using internal test utilities, of mypy.
> These were omitted during the migration for a few reasons:
>  * Simplicity.
>  * Relative slowness.
>  * Dependence on non public API.
>  
> Data tests are useful for a number of reasons:
>  
>  * Improve test coverage for type hints.
>  * Checking if type checkers infer expected types.
>  * Checking if type checkers reject incorrect code.
>  * Detecting unusual errors with code that otherwise type checks,
>  
> Especially, the last two functions are not fulfilled by simple validation of 
> existing codebase.
>  
> Data tests are not required for all annotations and can be restricted to code 
> that has high possibility of failure:
>  * Complex overloaded signatures.
>  * Complex generics.
>  * Generic {{self}} annotations
>  * Code containing {{type: ignore}}
> The biggest risk, is that output matchers have to be updated when signature 
> changes and / or mypy output changes.
> Example of problem detected with data tests can be found in SPARK-36894 PR 
> ([https://github.com/apache/spark/pull/34146]).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37020) Limit push down in DS V2



 [ 
https://issues.apache.org/jira/browse/SPARK-37020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37020:


Assignee: Apache Spark

> Limit push down in DS V2
> 
>
> Key: SPARK-37020
> URL: https://issues.apache.org/jira/browse/SPARK-37020
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36989) Migrate type hint data tests



 [ 
https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36989:


Assignee: Apache Spark

> Migrate type hint data tests
> 
>
> Key: SPARK-36989
> URL: https://issues.apache.org/jira/browse/SPARK-36989
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> Before the migration, {{pyspark-stubs}} contained a set of [data 
> tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit],
>  modeled after, and using internal test utilities, of mypy.
> These were omitted during the migration for a few reasons:
>  * Simplicity.
>  * Relative slowness.
>  * Dependence on non public API.
>  
> Data tests are useful for a number of reasons:
>  
>  * Improve test coverage for type hints.
>  * Checking if type checkers infer expected types.
>  * Checking if type checkers reject incorrect code.
>  * Detecting unusual errors with code that otherwise type checks,
>  
> Especially, the last two functions are not fulfilled by simple validation of 
> existing codebase.
>  
> Data tests are not required for all annotations and can be restricted to code 
> that has high possibility of failure:
>  * Complex overloaded signatures.
>  * Complex generics.
>  * Generic {{self}} annotations
>  * Code containing {{type: ignore}}
> The biggest risk, is that output matchers have to be updated when signature 
> changes and / or mypy output changes.
> Example of problem detected with data tests can be found in SPARK-36894 PR 
> ([https://github.com/apache/spark/pull/34146]).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36989) Migrate type hint data tests



 [ 
https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36989:


Assignee: (was: Apache Spark)

> Migrate type hint data tests
> 
>
> Key: SPARK-36989
> URL: https://issues.apache.org/jira/browse/SPARK-36989
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Before the migration, {{pyspark-stubs}} contained a set of [data 
> tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit],
>  modeled after, and using internal test utilities, of mypy.
> These were omitted during the migration for a few reasons:
>  * Simplicity.
>  * Relative slowness.
>  * Dependence on non public API.
>  
> Data tests are useful for a number of reasons:
>  
>  * Improve test coverage for type hints.
>  * Checking if type checkers infer expected types.
>  * Checking if type checkers reject incorrect code.
>  * Detecting unusual errors with code that otherwise type checks,
>  
> Especially, the last two functions are not fulfilled by simple validation of 
> existing codebase.
>  
> Data tests are not required for all annotations and can be restricted to code 
> that has high possibility of failure:
>  * Complex overloaded signatures.
>  * Complex generics.
>  * Generic {{self}} annotations
>  * Code containing {{type: ignore}}
> The biggest risk, is that output matchers have to be updated when signature 
> changes and / or mypy output changes.
> Example of problem detected with data tests can be found in SPARK-36894 PR 
> ([https://github.com/apache/spark/pull/34146]).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37020) Limit push down in DS V2



[ 
https://issues.apache.org/jira/browse/SPARK-37020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429395#comment-17429395
 ] 

Apache Spark commented on SPARK-37020:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34291

> Limit push down in DS V2
> 
>
> Key: SPARK-37020
> URL: https://issues.apache.org/jira/browse/SPARK-37020
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37021) JDBC option "sessionInitStatement" does not execute set sql statement when resolving a table

2021-10-15 Thread Valery Meleshkin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valery Meleshkin updated SPARK-37021:
-
Description: 
If {{sessionInitStatement}} is required to grant permissions or resolve an 
ambiguity, schema resolution will fail when reading a JDBC table.

Consider the following example running against Oracle database:

{code:scala}
reader.format("jdbc").options(
  Map(
"url" -> jdbcUrl,
"dbtable" -> "SELECT * FROM FOO",
"user" -> "BOB",
"sessionInitStatement" -> """ALTER SESSION SET CURRENT_SCHEMA = "BAR,
"password" -> password
  )).load
{code}

Table {{FOO}} is in schema {{BAR}}, but default value for {{CURRENT_SCHEMA}} 
for the JDBC connection will be {{BOB}}. Therefore, the code above will fail 
with an error ({{ORA-00942: table or view does not exist}} if it's Oracle). It 
happens because [resolveTable 
|https://github.com/apache/spark/blob/9d061e3939a021c602c070fc13cef951a8f94c82/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L67]
that is called during planning phase ignores {{sessionInitStatement}}.

  was:
If {{sessionInitStatement}} is required to grant permissions or resolve an 
ambiguity, schema resolution will fail when reading a JDBC table.

Consider the following example running against Oracle database:

{code:scala}
reader.format("jdbc").options(
  Map(
"url" -> jdbcUrl,
"dbtable" -> "SELECT * FROM FOO",
"user" -> "BOB",
"sessionInitStatement" -> """ALTER SESSION SET CURRENT_SCHEMA = "BAR,
"password" -> password
  )).load
{code}

Table {{FOO}} is in schema {{BAR}}, but default value for {{CURRENT_SCHEMA}} 
for the JDBC connection will be {{BOB}}. Therefore, the code above will fail 
with an error ({{ORA-00942: table or view does not exist}} if it's Oracle). It 
happens because [resolveTable |resolveTable]
that is called during planning phase ignores `sessionInitStatement`.


> JDBC option "sessionInitStatement" does not execute set sql statement when 
> resolving a table
> 
>
> Key: SPARK-37021
> URL: https://issues.apache.org/jira/browse/SPARK-37021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.2
>Reporter: Valery Meleshkin
>Priority: Major
>
> If {{sessionInitStatement}} is required to grant permissions or resolve an 
> ambiguity, schema resolution will fail when reading a JDBC table.
> Consider the following example running against Oracle database:
> {code:scala}
> reader.format("jdbc").options(
>   Map(
> "url" -> jdbcUrl,
> "dbtable" -> "SELECT * FROM FOO",
> "user" -> "BOB",
> "sessionInitStatement" -> """ALTER SESSION SET CURRENT_SCHEMA = "BAR,
> "password" -> password
>   )).load
> {code}
> Table {{FOO}} is in schema {{BAR}}, but default value for {{CURRENT_SCHEMA}} 
> for the JDBC connection will be {{BOB}}. Therefore, the code above will fail 
> with an error ({{ORA-00942: table or view does not exist}} if it's Oracle). 
> It happens because [resolveTable 
> |https://github.com/apache/spark/blob/9d061e3939a021c602c070fc13cef951a8f94c82/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L67]
> that is called during planning phase ignores {{sessionInitStatement}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37021) JDBC option "sessionInitStatement" does not execute set sql statement when resolving a table

2021-10-15 Thread Valery Meleshkin (Jira)

Valery Meleshkin created SPARK-37021:


 Summary: JDBC option "sessionInitStatement" does not execute set 
sql statement when resolving a table
 Key: SPARK-37021
 URL: https://issues.apache.org/jira/browse/SPARK-37021
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.2
Reporter: Valery Meleshkin


If {{sessionInitStatement}} is required to grant permissions or resolve an 
ambiguity, schema resolution will fail when reading a JDBC table.

Consider the following example running against Oracle database:

{code:scala}
reader.format("jdbc").options(
  Map(
"url" -> jdbcUrl,
"dbtable" -> "SELECT * FROM FOO",
"user" -> "BOB",
"sessionInitStatement" -> """ALTER SESSION SET CURRENT_SCHEMA = "BAR,
"password" -> password
  )).load
{code}

Table {{FOO}} is in schema {{BAR}}, but default value for {{CURRENT_SCHEMA}} 
for the JDBC connection will be {{BOB}}. Therefore, the code above will fail 
with an error ({{ORA-00942: table or view does not exist}} if it's Oracle). It 
happens because [resolveTable |resolveTable]
that is called during planning phase ignores `sessionInitStatement`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37020) Limit push down in DS V2

2021-10-15 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-37020:
--

 Summary: Limit push down in DS V2
 Key: SPARK-37020
 URL: https://issues.apache.org/jira/browse/SPARK-37020
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Huaxin Gao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36276) Update maven-checkstyle-plugin to 3.1.2 and checkstyle to 8.43



[ 
https://issues.apache.org/jira/browse/SPARK-36276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429343#comment-17429343
 ] 

Apache Spark commented on SPARK-36276:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34295

> Update maven-checkstyle-plugin to 3.1.2 and checkstyle to 8.43
> --
>
> Key: SPARK-36276
> URL: https://issues.apache.org/jira/browse/SPARK-36276
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36276) Update maven-checkstyle-plugin to 3.1.2 and checkstyle to 8.43



[ 
https://issues.apache.org/jira/browse/SPARK-36276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429341#comment-17429341
 ] 

Apache Spark commented on SPARK-36276:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34295

> Update maven-checkstyle-plugin to 3.1.2 and checkstyle to 8.43
> --
>
> Key: SPARK-36276
> URL: https://issues.apache.org/jira/browse/SPARK-36276
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35926) Support YearMonthIntervalType in width-bucket function

2021-10-15 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35926.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33132
[https://github.com/apache/spark/pull/33132]

> Support YearMonthIntervalType in width-bucket function
> --
>
> Key: SPARK-35926
> URL: https://issues.apache.org/jira/browse/SPARK-35926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> At now, width-bucket support the type [DoubleType, DoubleType, DoubleType, 
> LongType],
> we hope that support[YearMonthIntervalType, YearMonthIntervalType, 
> YearMonthIntervalType, LongType]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35926) Support YearMonthIntervalType in width-bucket function

2021-10-15 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-35926:


Assignee: PengLei

> Support YearMonthIntervalType in width-bucket function
> --
>
> Key: SPARK-35926
> URL: https://issues.apache.org/jira/browse/SPARK-35926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
>
> At now, width-bucket support the type [DoubleType, DoubleType, DoubleType, 
> LongType],
> we hope that support[YearMonthIntervalType, YearMonthIntervalType, 
> YearMonthIntervalType, LongType]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37019) Add Codegen support to ArrayTransform



[ 
https://issues.apache.org/jira/browse/SPARK-37019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429254#comment-17429254
 ] 

Apache Spark commented on SPARK-37019:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/34294

> Add Codegen support to ArrayTransform
> -
>
> Key: SPARK-37019
> URL: https://issues.apache.org/jira/browse/SPARK-37019
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Adam Binford
>Priority: Major
>
> Currently all of the higher order functions use CodegenFallback. We can 
> improve the performance of these by adding proper codegen support, so the 
> function as well as all children can be codegen'd, and it can participate in 
> WholeStageCodegen.
> This ticket is for adding support to ArrayTransform as the first step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37019) Add Codegen support to ArrayTransform



 [ 
https://issues.apache.org/jira/browse/SPARK-37019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37019:


Assignee: (was: Apache Spark)

> Add Codegen support to ArrayTransform
> -
>
> Key: SPARK-37019
> URL: https://issues.apache.org/jira/browse/SPARK-37019
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Adam Binford
>Priority: Major
>
> Currently all of the higher order functions use CodegenFallback. We can 
> improve the performance of these by adding proper codegen support, so the 
> function as well as all children can be codegen'd, and it can participate in 
> WholeStageCodegen.
> This ticket is for adding support to ArrayTransform as the first step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37019) Add Codegen support to ArrayTransform



 [ 
https://issues.apache.org/jira/browse/SPARK-37019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37019:


Assignee: Apache Spark

> Add Codegen support to ArrayTransform
> -
>
> Key: SPARK-37019
> URL: https://issues.apache.org/jira/browse/SPARK-37019
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Adam Binford
>Assignee: Apache Spark
>Priority: Major
>
> Currently all of the higher order functions use CodegenFallback. We can 
> improve the performance of these by adding proper codegen support, so the 
> function as well as all children can be codegen'd, and it can participate in 
> WholeStageCodegen.
> This ticket is for adding support to ArrayTransform as the first step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37019) Add Codegen support to ArrayTransform

2021-10-15 Thread Adam Binford (Jira)

Adam Binford created SPARK-37019:


 Summary: Add Codegen support to ArrayTransform
 Key: SPARK-37019
 URL: https://issues.apache.org/jira/browse/SPARK-37019
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Adam Binford


Currently all of the higher order functions use CodegenFallback. We can improve 
the performance of these by adding proper codegen support, so the function as 
well as all children can be codegen'd, and it can participate in 
WholeStageCodegen.

This ticket is for adding support to ArrayTransform as the first step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36987) Add Doc about FROM statement

2021-10-15 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-36987.
---
Resolution: Not A Problem

> Add Doc about FROM statement
> 
>
> Key: SPARK-36987
> URL: https://issues.apache.org/jira/browse/SPARK-36987
> Project: Spark
>  Issue Type: Task
>  Components: docs
>Affects Versions: 3.2.1
>Reporter: angerszhu
>Priority: Major
>
> Add Doc about FROM statement



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37018) Spark SQL should support create function with Aggregator

2021-10-15 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429198#comment-17429198
 ] 

jiaan.geng commented on SPARK-37018:


I'm working.

> Spark SQL should support create function with Aggregator
> 
>
> Key: SPARK-37018
> URL: https://issues.apache.org/jira/browse/SPARK-37018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL not support create function with Aggregator and deprecated 
> UserDefinedAggregateFunction.
> If we remove UserDefinedAggregateFunction, Spark SQL should provide a new 
> option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37018) Spark SQL should support create function with Aggregator

2021-10-15 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-37018:
--

 Summary: Spark SQL should support create function with Aggregator
 Key: SPARK-37018
 URL: https://issues.apache.org/jira/browse/SPARK-37018
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: jiaan.geng


Spark SQL not support create function with Aggregator and deprecated 
UserDefinedAggregateFunction.
If we remove UserDefinedAggregateFunction, Spark SQL should provide a new 
option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37016) Publicise UpperCaseCharStream

2021-10-15 Thread dohongdayi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dohongdayi updated SPARK-37016:
---
Description: 
Many Spark extension projects are copying `UpperCaseCharStream` because it is 
private beneath `parser` package, such as:

[Delta 
Lake|https://github.com/delta-io/delta/blob/625de3b305f109441ad04b20dba91dd6c4e1d78e/core/src/main/scala/io/delta/sql/parser/DeltaSqlParser.scala#L290]

[Hudi|https://github.com/apache/hudi/blob/3f8ca1a3552bb866163d3b1648f68d9c4824e21d/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieCommonSqlParser.scala#L112]

[Iceberg|https://github.com/apache/iceberg/blob/c3ac4c6ca74a0013b4705d5bd5d17fade8e6f499/spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala#L175]

[Submarine|https://github.com/apache/submarine/blob/2faebb8efd69833853f62d89b4f1fea1b1718148/submarine-security/spark-security/src/main/scala/org/apache/submarine/spark/security/parser/UpperCaseCharStream.scala#L31]

[Kyuubi|https://github.com/apache/incubator-kyuubi/blob/8a5134e3223844714fc58833a6859d4df5b68d57/dev/kyuubi-extension-spark-common/src/main/scala/org/apache/kyuubi/sql/zorder/ZorderSparkSqlExtensionsParserBase.scala#L108]

[Spark-ACID|https://github.com/qubole/spark-acid/blob/19bd6db757677c40f448e85c74d9995ba97d5942/src/main/scala/com/qubole/spark/datasources/hiveacid/sql/catalyst/parser/ParseDriver.scala#L13]

We can publicise `UpperCaseCharStream` to eliminate code duplication.

  was:
Many Spark extension projects are copying `UpperCaseCharStream` because it is 
private beneath `parser` package, such as:

[Hudi|https://github.com/apache/hudi/blob/3f8ca1a3552bb866163d3b1648f68d9c4824e21d/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieCommonSqlParser.scala#L112]

[Iceberg|https://github.com/apache/iceberg/blob/c3ac4c6ca74a0013b4705d5bd5d17fade8e6f499/spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala#L175]

[Delta 
Lake|https://github.com/delta-io/delta/blob/625de3b305f109441ad04b20dba91dd6c4e1d78e/core/src/main/scala/io/delta/sql/parser/DeltaSqlParser.scala#L290]

[Submarine|https://github.com/apache/submarine/blob/2faebb8efd69833853f62d89b4f1fea1b1718148/submarine-security/spark-security/src/main/scala/org/apache/submarine/spark/security/parser/UpperCaseCharStream.scala#L31]

[Kyuubi|https://github.com/apache/incubator-kyuubi/blob/8a5134e3223844714fc58833a6859d4df5b68d57/dev/kyuubi-extension-spark-common/src/main/scala/org/apache/kyuubi/sql/zorder/ZorderSparkSqlExtensionsParserBase.scala#L108]

[Spark-ACID|https://github.com/qubole/spark-acid/blob/19bd6db757677c40f448e85c74d9995ba97d5942/src/main/scala/com/qubole/spark/datasources/hiveacid/sql/catalyst/parser/ParseDriver.scala#L13]

We can publicise `UpperCaseCharStream` to eliminate code duplication.


> Publicise UpperCaseCharStream
> -
>
> Key: SPARK-37016
> URL: https://issues.apache.org/jira/browse/SPARK-37016
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.1, 3.1.2, 3.2.0
>Reporter: dohongdayi
>Priority: Major
> Fix For: 2.4.9, 3.1.3, 3.0.4, 3.2.1, 3.3.0
>
>
> Many Spark extension projects are copying `UpperCaseCharStream` because it is 
> private beneath `parser` package, such as:
> [Delta 
> Lake|https://github.com/delta-io/delta/blob/625de3b305f109441ad04b20dba91dd6c4e1d78e/core/src/main/scala/io/delta/sql/parser/DeltaSqlParser.scala#L290]
> [Hudi|https://github.com/apache/hudi/blob/3f8ca1a3552bb866163d3b1648f68d9c4824e21d/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieCommonSqlParser.scala#L112]
> [Iceberg|https://github.com/apache/iceberg/blob/c3ac4c6ca74a0013b4705d5bd5d17fade8e6f499/spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala#L175]
> [Submarine|https://github.com/apache/submarine/blob/2faebb8efd69833853f62d89b4f1fea1b1718148/submarine-security/spark-security/src/main/scala/org/apache/submarine/spark/security/parser/UpperCaseCharStream.scala#L31]
> [Kyuubi|https://github.com/apache/incubator-kyuubi/blob/8a5134e3223844714fc58833a6859d4df5b68d57/dev/kyuubi-extension-spark-common/src/main/scala/org/apache/kyuubi/sql/zorder/ZorderSparkSqlExtensionsParserBase.scala#L108]
> [Spark-ACID|https://github.com/qubole/spark-acid/blob/19bd6db757677c40f448e85c74d9995ba97d5942/src/main/scala/com/qubole/spark/datasources/hiveacid/sql/catalyst/parser/ParseDriver.scala#L13]
> We can publicise `UpperCaseCharStream` to eliminate code duplication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.



[ 
https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429162#comment-17429162
 ] 

Zhixiong Chen commented on SPARK-37017:
---

I have created a pull request for this issue: 
https://github.com/apache/spark/pull/34292

> Reduce the scope of synchronized to prevent deadlock.
> -
>
> Key: SPARK-37017
> URL: https://issues.apache.org/jira/browse/SPARK-37017
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Zhixiong Chen
>Priority: Minor
>
> There is a synchronized in CatalogManager.currentNamespace function.
> Sometimes a deadlock occurs.
> The scope of synchronized can be reduced to prevent deadlock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37014) Inline type hints for python/pyspark/streaming/context.py



 [ 
https://issues.apache.org/jira/browse/SPARK-37014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37014:


Assignee: Apache Spark

> Inline type hints for python/pyspark/streaming/context.py
> -
>
> Key: SPARK-37014
> URL: https://issues.apache.org/jira/browse/SPARK-37014
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37014) Inline type hints for python/pyspark/streaming/context.py



 [ 
https://issues.apache.org/jira/browse/SPARK-37014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37014:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/streaming/context.py
> -
>
> Key: SPARK-37014
> URL: https://issues.apache.org/jira/browse/SPARK-37014
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.



 [ 
https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37017:


Assignee: (was: Apache Spark)

> Reduce the scope of synchronized to prevent deadlock.
> -
>
> Key: SPARK-37017
> URL: https://issues.apache.org/jira/browse/SPARK-37017
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Zhixiong Chen
>Priority: Minor
>
> There is a synchronized in CatalogManager.currentNamespace function.
> Sometimes a deadlock occurs.
> The scope of synchronized can be reduced to prevent deadlock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37014) Inline type hints for python/pyspark/streaming/context.py



[ 
https://issues.apache.org/jira/browse/SPARK-37014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429179#comment-17429179
 ] 

Apache Spark commented on SPARK-37014:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34293

> Inline type hints for python/pyspark/streaming/context.py
> -
>
> Key: SPARK-37014
> URL: https://issues.apache.org/jira/browse/SPARK-37014
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.



[ 
https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429178#comment-17429178
 ] 

Apache Spark commented on SPARK-37017:
--

User 'chenzhx' has created a pull request for this issue:
https://github.com/apache/spark/pull/34292

> Reduce the scope of synchronized to prevent deadlock.
> -
>
> Key: SPARK-37017
> URL: https://issues.apache.org/jira/browse/SPARK-37017
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Zhixiong Chen
>Priority: Minor
>
> There is a synchronized in CatalogManager.currentNamespace function.
> Sometimes a deadlock occurs.
> The scope of synchronized can be reduced to prevent deadlock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.



 [ 
https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37017:


Assignee: Apache Spark

> Reduce the scope of synchronized to prevent deadlock.
> -
>
> Key: SPARK-37017
> URL: https://issues.apache.org/jira/browse/SPARK-37017
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Zhixiong Chen
>Assignee: Apache Spark
>Priority: Minor
>
> There is a synchronized in CatalogManager.currentNamespace function.
> Sometimes a deadlock occurs.
> The scope of synchronized can be reduced to prevent deadlock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.



 [ 
https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhixiong Chen updated SPARK-37017:
--
Comment: was deleted

(was: I have created a pull request for this issue:
[https://github.com/apache/spark/pull/34292])

> Reduce the scope of synchronized to prevent deadlock.
> -
>
> Key: SPARK-37017
> URL: https://issues.apache.org/jira/browse/SPARK-37017
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Zhixiong Chen
>Priority: Minor
>
> There is a synchronized in CatalogManager.currentNamespace function.
> Sometimes a deadlock occurs.
> The scope of synchronized can be reduced to prevent deadlock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.



 [ 
https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhixiong Chen updated SPARK-37017:
--
Comment: was deleted

(was: I have created a pull request for this issue:

[https://github.com/apache/spark/pull/34292])

> Reduce the scope of synchronized to prevent deadlock.
> -
>
> Key: SPARK-37017
> URL: https://issues.apache.org/jira/browse/SPARK-37017
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Zhixiong Chen
>Priority: Minor
>
> There is a synchronized in CatalogManager.currentNamespace function.
> Sometimes a deadlock occurs.
> The scope of synchronized can be reduced to prevent deadlock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.



[ 
https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429161#comment-17429161
 ] 

Zhixiong Chen commented on SPARK-37017:
---

I have created a pull request for this issue:

[https://github.com/apache/spark/pull/34292]

> Reduce the scope of synchronized to prevent deadlock.
> -
>
> Key: SPARK-37017
> URL: https://issues.apache.org/jira/browse/SPARK-37017
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Zhixiong Chen
>Priority: Minor
>
> There is a synchronized in CatalogManager.currentNamespace function.
> Sometimes a deadlock occurs.
> The scope of synchronized can be reduced to prevent deadlock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.



[ 
https://issues.apache.org/jira/browse/SPARK-37017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429160#comment-17429160
 ] 

Zhixiong Chen commented on SPARK-37017:
---

I have created a pull request for this issue:
[https://github.com/apache/spark/pull/34292]

> Reduce the scope of synchronized to prevent deadlock.
> -
>
> Key: SPARK-37017
> URL: https://issues.apache.org/jira/browse/SPARK-37017
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Zhixiong Chen
>Priority: Minor
>
> There is a synchronized in CatalogManager.currentNamespace function.
> Sometimes a deadlock occurs.
> The scope of synchronized can be reduced to prevent deadlock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37017) Reduce the scope of synchronized to prevent deadlock.