[jira] [Commented] (SPARK-47927) Nullability after join not respected in UDF

2024-06-25 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859940#comment-17859940
 ] 

GridGain Integration commented on SPARK-47927:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/47081

> Nullability after join not respected in UDF
> ---
>
> Key: SPARK-47927
> URL: https://issues.apache.org/jira/browse/SPARK-47927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1, 3.4.3
>Reporter: Emil Ejbyfeldt
>Assignee: Emil Ejbyfeldt
>Priority: Major
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.4
>
>
> {code:java}
> val ds1 = Seq(1).toDS()
> val ds2 = Seq[Int]().toDS()
> val f = udf[(Int, Option[Int]), (Int, Option[Int])](identity)
> ds1.join(ds2, ds1("value") === ds2("value"), 
> "outer").select(f(struct(ds1("value"), ds2("value".show()
> ds1.join(ds2, ds1("value") === ds2("value"), 
> "outer").select(struct(ds1("value"), ds2("value"))).show() {code}
> outputs
> {code:java}
> +---+
> |UDF(struct(value, value, value, value))|
> +---+
> |                                 {1, 0}|
> +---+
> ++
> |struct(value, value)|
> ++
> |           {1, NULL}|
> ++ {code}
> So when the result is passed to UDF the null-ability after the the join is 
> not respected and we incorrectly end up with a 0 value instead of a null/None 
> value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45982) re-org R package installations

2023-11-20 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17787878#comment-17787878
 ] 

GridGain Integration commented on SPARK-45982:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/43904

> re-org R package installations
> --
>
> Key: SPARK-45982
> URL: https://issues.apache.org/jira/browse/SPARK-45982
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45982) re-org R package installations

2023-11-20 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17787877#comment-17787877
 ] 

GridGain Integration commented on SPARK-45982:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/43904

> re-org R package installations
> --
>
> Key: SPARK-45982
> URL: https://issues.apache.org/jira/browse/SPARK-45982
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45392) Replace `Class.newInstance()` with `Class.getDeclaredConstructor().newInstance()`

2023-10-01 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770868#comment-17770868
 ] 

GridGain Integration commented on SPARK-45392:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/43193

> Replace `Class.newInstance()` with 
> `Class.getDeclaredConstructor().newInstance()`
> -
>
> Key: SPARK-45392
> URL: https://issues.apache.org/jira/browse/SPARK-45392
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45014) Clean up fileserver when cleaning up files, jars and archives in SparkContext

2023-08-30 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760658#comment-17760658
 ] 

GridGain Integration commented on SPARK-45014:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/42731

> Clean up fileserver when cleaning up files, jars and archives in SparkContext
> -
>
> Key: SPARK-45014
> URL: https://issues.apache.org/jira/browse/SPARK-45014
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In SPARK-44348, we clean up Spark Context's added files but we don't clean up 
> the ones in fileserver.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44931) Fix JSON Serailization for Spark Connect Event Listener

2023-08-23 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758107#comment-17758107
 ] 

GridGain Integration commented on SPARK-44931:
--

User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/42630

> Fix JSON Serailization for Spark Connect Event Listener
> ---
>
> Key: SPARK-44931
> URL: https://issues.apache.org/jira/browse/SPARK-44931
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44906) Move substituteAppNExecIds logic into kubernetesConf.annotations method

2023-08-22 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757238#comment-17757238
 ] 

GridGain Integration commented on SPARK-44906:
--

User 'zwangsheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/42600

> Move substituteAppNExecIds logic into kubernetesConf.annotations method 
> 
>
> Key: SPARK-44906
> URL: https://issues.apache.org/jira/browse/SPARK-44906
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.1
>Reporter: Binjie Yang
>Priority: Major
>
> Move Utils. SubstituteAppNExecIds logic  into KubernetesConf.annotations as 
> the default logic, easy for users to reuse, rather than to rewrite it again 
> at the same logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44795) CodeGenCache should be ClassLoader specific

2023-08-15 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754852#comment-17754852
 ] 

GridGain Integration commented on SPARK-44795:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/42508

> CodeGenCache should be ClassLoader specific
> ---
>
> Key: SPARK-44795
> URL: https://issues.apache.org/jira/browse/SPARK-44795
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 3.5.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Blocker
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44653) non-trivial DataFrame unions should not break caching

2023-08-14 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754116#comment-17754116
 ] 

GridGain Integration commented on SPARK-44653:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/42483

> non-trivial DataFrame unions should not break caching
> -
>
> Key: SPARK-44653
> URL: https://issues.apache.org/jira/browse/SPARK-44653
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.3.3, 3.4.2, 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43885) DataSource V2: Handle MERGE commands for delta-based sources

2023-08-14 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753921#comment-17753921
 ] 

GridGain Integration commented on SPARK-43885:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/42482

> DataSource V2: Handle MERGE commands for delta-based sources
> 
>
> Key: SPARK-43885
> URL: https://issues.apache.org/jira/browse/SPARK-43885
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.5.0
>
>
> We should handle MERGE commands for delta-based sources, just like DELETE and 
> UPDATE.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44447) Use PartitionEvaluator API in FlatMapGroupsInPandasExec, FlatMapCoGroupsInPandasExec

2023-08-12 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753564#comment-17753564
 ] 

GridGain Integration commented on SPARK-7:
--

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/42025

> Use PartitionEvaluator API in FlatMapGroupsInPandasExec, 
> FlatMapCoGroupsInPandasExec
> 
>
> Key: SPARK-7
> URL: https://issues.apache.org/jira/browse/SPARK-7
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Vinod KC
>Priority: Major
>
> Use PartitionEvaluator API in
> `FlatMapGroupsInPandasExec`
> `FlatMapCoGroupsInPandasExec`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44305) Broadcast operation is not required when no parameters are specified

2023-08-12 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753565#comment-17753565
 ] 

GridGain Integration commented on SPARK-44305:
--

User '7mming7' has created a pull request for this issue:
https://github.com/apache/spark/pull/42037

> Broadcast operation is not required when no parameters are specified
> 
>
> Key: SPARK-44305
> URL: https://issues.apache.org/jira/browse/SPARK-44305
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: 7mming7
>Priority: Minor
> Attachments: image-2023-07-05-11-51-41-708.png
>
>
> The ability introduced by SPARK-14912, we can broadcast the parameters of the 
> data source to the read and write operations, but if the user does not 
> specify a specific parameter, the propagation operation will also be 
> performed, which affects the performance has a greater impact, so we need to 
> avoid broadcasting the full Hadoop parameters when the user does not specify 
> a specific parameter
>  
> !image-2023-07-05-11-51-41-708.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44407) Prohibit using `enum` as a variable or function name

2023-08-12 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753563#comment-17753563
 ] 

GridGain Integration commented on SPARK-44407:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/41982

> Prohibit using `enum` as a variable or function name
> 
>
> Key: SPARK-44407
> URL: https://issues.apache.org/jira/browse/SPARK-44407
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> {code:java}
> [warn] 
> /Users/yangjie01/SourceCode/git/spark-mine-sbt/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/JavaTypeInferenceSuite.scala:74:21:
>  [deprecation @  | origin= | version=2.13.7] Wrap `enum` in backticks to use 
> it as an identifier, it will become a keyword in Scala 3.
> [warn]   @BeanProperty var enum: java.time.Month = _ {code}
> enum will become a keyword in Scala 3.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44756) Executor hangs when RetryingBlockTransferor fails to initiate retry

2023-08-10 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752785#comment-17752785
 ] 

GridGain Integration commented on SPARK-44756:
--

User 'hdaikoku' has created a pull request for this issue:
https://github.com/apache/spark/pull/42426

> Executor hangs when RetryingBlockTransferor fails to initiate retry
> ---
>
> Key: SPARK-44756
> URL: https://issues.apache.org/jira/browse/SPARK-44756
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.3.1
>Reporter: Harunobu Daikoku
>Priority: Minor
>
> We have been observing this issue several times in our production where some 
> executors are being stuck at BlockTransferService#fetchBlockSync().
> After some investigation, the issue seems to be caused by an unhandled edge 
> case in RetryingBlockTransferor.
> 1. Shuffle transfer fails for whatever reason
> {noformat}
> java.io.IOException: Cannot allocate memory
>   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>   at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
>   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>   at sun.nio.ch.IOUtil.write(IOUtil.java:51)
>   at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
>   at 
> org.apache.spark.network.shuffle.SimpleDownloadFile$SimpleDownloadWritableChannel.write(SimpleDownloadFile.java:78)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onData(OneForOneBlockFetcher.java:340)
>   at 
> org.apache.spark.network.client.StreamInterceptor.handle(StreamInterceptor.java:79)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.feedInterceptor(TransportFrameDecoder.java:263)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:87)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
> {noformat}
> 2. The above exception caught by 
> [AbstractChannelHandlerContext#invokeChannelRead()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L381],
>  and propagated to the exception handler
> 3. Exception reaches 
> [RetryingBlockTransferor#initiateRetry()|https://github.com/apache/spark/blob/v3.3.1/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java#L178-L180],
>  and it tries to initiate retry
> {noformat}
> 23/08/09 16:58:37 shuffle-client-4-2 INFO RetryingBlockTransferor: Retrying 
> fetch (1/3) for 1 outstanding blocks after 5000 ms
> {noformat}
> 4. Retry initiation fails (in our case, it fails to create a new thread)
> 5. Exception caught by 
> [AbstractChannelHandlerContext#invokeExceptionCaught()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L305-L309],
>  and not further processed
> {noformat}
> 23/08/09 16:58:53 shuffle-client-4-2 DEBUG AbstractChannelHandlerContext: An 
> exception java.lang.OutOfMemoryError: unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:719)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.initiateRetry(RetryingBlockTransferor.java:182)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor.access$500(RetryingBlockTransferor.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.handleBlockTransferFailure(RetryingBlockTransferor.java:230)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.onBlockFetchFailure(RetryingBlockTransferor.java:260)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher.failRemainingBlocks(OneForOneBlockFetcher.java:318)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher.access$300(OneForOneBlockFetcher.java:55)
>   at 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onFailure(OneForOneBlockFetcher.java:357)
>   at 
> org.apache.spark.network.client.StreamInterceptor.exceptionCaught(StreamInterceptor.java:56)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.exceptionCaught(TransportFrameDecoder.java:231)
>   at 
> 

[jira] [Commented] (SPARK-43606) Enable IndexesTests.test_index_basic for pandas 2.0.0.

2023-08-01 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17749848#comment-17749848
 ] 

GridGain Integration commented on SPARK-43606:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/42267

> Enable IndexesTests.test_index_basic for pandas 2.0.0.
> --
>
> Key: SPARK-43606
> URL: https://issues.apache.org/jira/browse/SPARK-43606
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Enable IndexesTests.test_index_basic for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44098) Introduce python breaking change detection

2023-07-26 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747396#comment-17747396
 ] 

GridGain Integration commented on SPARK-44098:
--

User 'StardustDL' has created a pull request for this issue:
https://github.com/apache/spark/pull/42125

> Introduce python breaking change detection
> --
>
> Key: SPARK-44098
> URL: https://issues.apache.org/jira/browse/SPARK-44098
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> We have breaking change detections for Binary Compatibility and Protobufs, 
> but we don't have one for python.
> Authors of [aexpy|https://github.com/StardustDL/aexpy] are willing to help 
> PySpark detecting python breaking changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44509) Fine grained interrupt in Python Spark Connect

2023-07-24 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746270#comment-17746270
 ] 

GridGain Integration commented on SPARK-44509:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/42120

> Fine grained interrupt in Python Spark Connect
> --
>
> Key: SPARK-44509
> URL: https://issues.apache.org/jira/browse/SPARK-44509
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Same as https://issues.apache.org/jira/browse/SPARK-44422 but need it for 
> Python
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44505) DataSource v2 Scans should not require planning the input partitions on explain

2023-07-21 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17745598#comment-17745598
 ] 

GridGain Integration commented on SPARK-44505:
--

User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/42099

> DataSource v2 Scans should not require planning the input partitions on 
> explain
> ---
>
> Key: SPARK-44505
> URL: https://issues.apache.org/jira/browse/SPARK-44505
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>
> Right now, we will always call `planInputPartitions()` for a DSv2 
> implementation even if there is no spark job run but only explain.
> We should provide a way to avoid scanning all input partitions just to 
> determine if the input is columnar or not. The scan should provide an 
> override.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43611) Fix unexpected `AnalysisException` from Spark Connect client

2023-07-17 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17743734#comment-17743734
 ] 

GridGain Integration commented on SPARK-43611:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/42040

> Fix unexpected `AnalysisException` from Spark Connect client
> 
>
> Key: SPARK-43611
> URL: https://issues.apache.org/jira/browse/SPARK-43611
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Reproducible example:
> {code:java}
> >>> import pyspark.pandas as ps
> >>> psdf1 = ps.DataFrame({"A": [1, 2, 3]})
> >>> psdf2 = ps.DataFrame({"B": [1, 2, 3]})
> >>> psdf1.append(psdf2)
> /Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/frame.py:8897:
>  FutureWarning: The DataFrame.append method is deprecated and will be removed 
> in a future version. Use pyspark.pandas.concat instead.
>   warnings.warn(
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/frame.py", 
> line 8930, in append
>     return cast(DataFrame, concat([self, other], ignore_index=ignore_index))
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/namespace.py",
>  line 2703, in concat
>     psdfs[0]._internal.copy(
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/internal.py",
>  line 1508, in copy
>     return InternalFrame(
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/internal.py",
>  line 753, in __init__
>     schema = spark_frame.select(data_spark_columns).schema
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/dataframe.py",
>  line 1650, in schema
>     return self._session.client.schema(query)
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py",
>  line 777, in schema
>     schema = self._analyze(method="schema", plan=plan).schema
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py",
>  line 958, in _analyze
>     self._handle_error(error)
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py",
>  line 1195, in _handle_error
>     self._handle_rpc_error(error)
>   File 
> "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py",
>  line 1231, in _handle_rpc_error
>     raise convert_exception(info, status.message) from None
> pyspark.errors.exceptions.connect.AnalysisException: When resolving 'A, fail 
> to find subplan with plan_id=16 in 'Project ['A, 'B]
> +- Project [__index_level_0__#1101L, A#1102L, B#1157L, 
> monotonically_increasing_id() AS __natural_order__#1163L]
>    +- Union false, false
>       :- Project [__index_level_0__#1101L, A#1102L, cast(B#1116 as bigint) AS 
> B#1157L]
>       :  +- Project [__index_level_0__#1101L, A#1102L, B#1116]
>       :     +- Project [__index_level_0__#1101L, A#1102L, 
> __natural_order__#1108L, null AS B#1116]
>       :        +- Project [__index_level_0__#1101L, A#1102L, 
> __natural_order__#1108L]
>       :           +- Project [__index_level_0__#1101L, A#1102L, 
> monotonically_increasing_id() AS __natural_order__#1108L]
>       :              +- Project [__index_level_0__#1097L AS 
> __index_level_0__#1101L, A#1098L AS A#1102L]
>       :                 +- LocalRelation [__index_level_0__#1097L, A#1098L]
>       +- Project [__index_level_0__#1137L, cast(A#1152 as bigint) AS A#1158L, 
> B#1138L]
>          +- Project [__index_level_0__#1137L, A#1152, B#1138L]
>             +- Project [__index_level_0__#1137L, B#1138L, 
> __natural_order__#1144L, null AS A#1152]
>                +- Project [__index_level_0__#1137L, B#1138L, 
> __natural_order__#1144L]
>                   +- Project [__index_level_0__#1137L, B#1138L, 
> monotonically_increasing_id() AS __natural_order__#1144L]
>                      +- Project [__index_level_0__#1133L AS 
> __index_level_0__#1137L, B#1134L AS B#1138L]
>                         +- LocalRelation [__index_level_0__#1133L, B#1134L] 
> {code}
> Another example:
> {code:java}
> >>> pdf = pd.DataFrame(
> ...     {
> ...         "A": [None, 3, None, None],
> ...         "B": [2, 4, None, 3],
> ...         "C": [None, None, None, 1],
> ...         "D": [0, 1, 5, 4],
> ...     },
> ...     columns=["A", "B", "C", "D"],
> ... )
> >>> psdf = ps.from_pandas(pdf)
> >>> psdf.backfill()
> /Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/expressions.py:945:
>  UserWarning: WARN WindowExpression: No Partition Defined for Window 
> operation! Moving all data to a single partition, this 

[jira] [Commented] (SPARK-44406) DataFrame depending on temp view fail after the view is dropped

2023-07-13 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17742827#comment-17742827
 ] 

GridGain Integration commented on SPARK-44406:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/41986

> DataFrame depending on temp view fail after the view is dropped
> ---
>
> Key: SPARK-44406
> URL: https://issues.apache.org/jira/browse/SPARK-44406
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> In vanilla Spark:
> {code:java}
> In [1]: df = spark.createDataFrame([(1, 4), (2, 4), (3, 6)], ["A", "B"])
> In [2]: df.createOrReplaceTempView("t")
> In [3]: df2 = spark.sql("select * from t")
> In [4]: df2.show()
> +---+---+ 
>   
> |  A|  B|
> +---+---+
> |  1|  4|
> |  2|  4|
> |  3|  6|
> +---+---+
> In [5]: spark.catalog.dropTempView("t")
> Out[5]: True
> In [6]: df2.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  1|  4|
> |  2|  4|
> |  3|  6|
> +---+---+
> {code}
> In Spark Connect:
> {code:java}
> In [1]: df = spark.createDataFrame([(1, 4), (2, 4), (3, 6)], ["A", "B"])
> In [2]: df.createOrReplaceTempView("t")
> In [3]: df2 = spark.sql("select * from t")
> In [4]: df2.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  1|  4|
> |  2|  4|
> |  3|  6|
> +---+---+
> In [5]: spark.catalog.dropTempView("t")
> Out[5]: True
> In [6]: df2.show()
> 23/07/13 11:57:18 ERROR SparkConnectService: Error during: execute. UserId: 
> ruifeng.zheng. SessionId: 1fc234fd-07da-4ad0-9ec5-2d818cef6033.
> org.apache.spark.sql.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table 
> or view `t` cannot be found. Verify the spelling and correctness of the 
> schema and catalog.
> If you did not qualify the name with a schema, verify the current_schema() 
> output, or qualify the name with the correct schema and catalog.
> To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF 
> EXISTS.; line 1 pos 14;
> 'Project [*]
> +- 'UnresolvedRelation [t], [], false
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43974) Upgrade buf to v1.23.1

2023-07-12 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17742352#comment-17742352
 ] 

GridGain Integration commented on SPARK-43974:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/41937

> Upgrade buf to v1.23.1
> --
>
> Key: SPARK-43974
> URL: https://issues.apache.org/jira/browse/SPARK-43974
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44263) Allow ChannelBuilder extensions -- Scala

2023-07-07 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740973#comment-17740973
 ] 

GridGain Integration commented on SPARK-44263:
--

User 'cdkrot' has created a pull request for this issue:
https://github.com/apache/spark/pull/41880

> Allow ChannelBuilder extensions -- Scala
> 
>
> Key: SPARK-44263
> URL: https://issues.apache.org/jira/browse/SPARK-44263
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Alice Sayutina
>Priority: Major
>
> Follow up to https://issues.apache.org/jira/browse/SPARK-43332
> Provide similar extension capabilities in Scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43801) Support unwrap date type to string type in UnwrapCastInBinaryComparison

2023-06-19 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17734095#comment-17734095
 ] 

GridGain Integration commented on SPARK-43801:
--

User 'puchengy' has created a pull request for this issue:
https://github.com/apache/spark/pull/41332

> Support unwrap date type to string type in UnwrapCastInBinaryComparison
> ---
>
> Key: SPARK-43801
> URL: https://issues.apache.org/jira/browse/SPARK-43801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Pucheng Yang
>Priority: Major
>
> Similar to https://issues.apache.org/jira/browse/SPARK-42597 and others, add 
> support to 
> UnwrapCastInBinaryComparison such that it can unwrap date type to string type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44065) Optimize BroadcastHashJoin skew when localShuffleReader is disabled

2023-06-15 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733300#comment-17733300
 ] 

GridGain Integration commented on SPARK-44065:
--

User 'wForget' has created a pull request for this issue:
https://github.com/apache/spark/pull/41609

> Optimize BroadcastHashJoin skew when localShuffleReader is disabled
> ---
>
> Key: SPARK-44065
> URL: https://issues.apache.org/jira/browse/SPARK-44065
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Zhen Wang
>Priority: Major
>
> In RemoteShuffleService services such as uniffle and celeborn, it is 
> recommended to disable localShuffleReader by default for better performance. 
> But it may make BroadcastHashJoin skewed, so I want to optimize 
> BroadcastHashJoin skew in OptimizeSkewedJoin when localShuffleReader is 
> disabled.
>  
> Refer to:
> https://github.com/apache/incubator-celeborn#spark-configuration
> https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md#support-spark-aqe



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43511) Implemented State APIs for Spark Connect Scala

2023-06-15 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733195#comment-17733195
 ] 

GridGain Integration commented on SPARK-43511:
--

User 'bogao007' has created a pull request for this issue:
https://github.com/apache/spark/pull/41558

> Implemented State APIs for Spark Connect Scala
> --
>
> Key: SPARK-43511
> URL: https://issues.apache.org/jira/browse/SPARK-43511
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Bo Gao
>Priority: Major
>
> Implemented MapGroupsWithState and FlatMapGroupsWithState APIs for Spark 
> Connect Scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44057) Mark all `local-cluster` tests as `ExtendedSQLTest`

2023-06-14 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732751#comment-17732751
 ] 

GridGain Integration commented on SPARK-44057:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/41601

> Mark all `local-cluster` tests as `ExtendedSQLTest`
> ---
>
> Key: SPARK-44057
> URL: https://issues.apache.org/jira/browse/SPARK-44057
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.5.0
>
>
> This issue aims to mark all `local-cluster` tests as `ExtendedSQLTest`
> https://pipelines.actions.githubusercontent.com/serviceHosts/03398d36-4378-4d47-a936-fba0a5e8ccb9/_apis/pipelines/1/runs/251144/signedlogcontent/12?urlExpires=2023-06-14T17%3A11%3A50.2399742Z=HMACV1=%2FHTlrgaHtF2Jv65vw%2Fj4SzT69etebI0swSSM6dXC0tk%3D
> {code}
> $ git grep local-cluster sql/core/
> sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala:  
>   val session = SparkSession.builder().master("local-cluster[3, 1, 
> 1024]").getOrCreate()
> sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala:  
>   val session = SparkSession.builder().master("local-cluster[3, 1, 
> 1024]").getOrCreate()
> sql/core/src/test/scala/org/apache/spark/sql/execution/BroadcastExchangeSuite.scala://
>  Additional tests run in 'local-cluster' mode.
> sql/core/src/test/scala/org/apache/spark/sql/execution/BroadcastExchangeSuite.scala:
>   .setMaster("local-cluster[2,1,1024]")
> sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala:
>   "--master", "local-cluster[1,1,1024]",
> sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala:
>* Create a new [[SparkSession]] running in local-cluster mode with unsafe 
> and codegen enabled.
> sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala:
>   .master("local-cluster[2,1,1024]")
> sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala:
>  * Tests in this suite we need to run Spark in local-cluster mode. In 
> particular, the use of
> sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala:
>* Create a new [[SparkSession]] running in local-cluster mode with unsafe 
> and codegen enabled.
> sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala:
>   .master("local-cluster[2,1,512]")
> sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDDSuite.scala:
>   .config(sparkConf.setMaster("local-cluster[2, 1, 1024]"))
> sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala:
>   // Create a new [[SparkSession]] running in local-cluster mode.
> sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala:
>   .master("local-cluster[2,1,1024]")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43943) Add math functions to Scala and Python

2023-06-02 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728738#comment-17728738
 ] 

GridGain Integration commented on SPARK-43943:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/41435

> Add math functions to Scala and Python
> --
>
> Key: SPARK-43943
> URL: https://issues.apache.org/jira/browse/SPARK-43943
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> Add following functions:
> * ceiling
> * e
> * pi
> * ln
> * negative
> * positive
> * power
> * sign
> * std
> * width_bucket
> to:
> * Scala API
> * Python API
> * Spark Connect Scala Client
> * Spark Connect Python Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43075) Change gRPC to grpcio when it is not installed.

2023-06-02 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728681#comment-17728681
 ] 

GridGain Integration commented on SPARK-43075:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/40716

> Change gRPC to grpcio when it is not installed.
> ---
>
> Key: SPARK-43075
> URL: https://issues.apache.org/jira/browse/SPARK-43075
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43063) `df.show` handle null should print NULL instead of null

2023-06-02 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728682#comment-17728682
 ] 

GridGain Integration commented on SPARK-43063:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/41432

> `df.show` handle null should print NULL instead of null
> ---
>
> Key: SPARK-43063
> URL: https://issues.apache.org/jira/browse/SPARK-43063
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: yikaifei
>Assignee: yikaifei
>Priority: Trivial
> Fix For: 3.5.0
>
>
> `df.show` handle null should print NULL instead of null to consistent 
> behavior;
> {code:java}
> Like as the following behavior is currently inconsistent:
> ``` shell
> scala> spark.sql("select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 
> 'New Jersey', 4, 'Seattle') as result").show(false)
> +--+
> |result|
> +--+
> |null  |
> +--+
> ```
> ``` shell
> spark-sql> DESC FUNCTION EXTENDED decode;
> function_desc
> Function: decode
> Class: org.apache.spark.sql.catalyst.expressions.Decode
> Usage:
> decode(bin, charset) - Decodes the first argument using the second 
> argument character set.
> decode(expr, search, result [, search, result ] ... [, default]) - 
> Compares expr
>   to each search value in order. If expr is equal to a search value, 
> decode returns
>   the corresponding result. If no match is found, then it returns 
> default. If default
>   is omitted, it returns null.
> Extended Usage:
> Examples:
>   > SELECT decode(encode('abc', 'utf-8'), 'utf-8');
>abc
>   > SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle', 'Non domestic');
>San Francisco
>   > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle', 'Non domestic');
>Non domestic
>   > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle');
>NULL
> Since: 3.2.0
> Time taken: 0.074 seconds, Fetched 4 row(s)
> ```
> ``` shell
> spark-sql> select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New 
> Jersey', 4, 'Seattle');
> NULL
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43205) Add an IDENTIFIER(stringLiteral) clause that maps a string to an identifier

2023-05-30 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727621#comment-17727621
 ] 

GridGain Integration commented on SPARK-43205:
--

User 'srielau' has created a pull request for this issue:
https://github.com/apache/spark/pull/40884

> Add an IDENTIFIER(stringLiteral) clause that maps a string to an identifier
> ---
>
> Key: SPARK-43205
> URL: https://issues.apache.org/jira/browse/SPARK-43205
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Serge Rielau
>Assignee: Serge Rielau
>Priority: Major
> Fix For: 3.5.0
>
>
> There is a requirement for SQL templates, where the table and or column names 
> are provided through substitution. This can be done today using variable 
> substitution:
> SET hivevar:tabname = mytab;
> SELECT * FROM ${ hivevar:tabname };
> A straight variable substitution is dangerous since it does allow for SQL 
> injection:
> SET hivevar:tabname = mytab, someothertab;
> SELECT * FROM ${ hivevar:tabname };
> A way to get around this problem is to wrap the variable substitution with a 
> clause that limits the scope t produce an identifier.
> This approach is taken by Snowflake:
>  
> [https://docs.snowflake.com/en/sql-reference/session-variables#using-variables-in-sql]
> SET hivevar:tabname = 'tabname';
> SELECT * FROM IDENTIFIER(${ hivevar:tabname })



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43171) Support dynamic changing unix user in Pod

2023-05-23 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725489#comment-17725489
 ] 

GridGain Integration commented on SPARK-43171:
--

User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/40831

> Support dynamic changing unix user in Pod
> -
>
> Key: SPARK-43171
> URL: https://issues.apache.org/jira/browse/SPARK-43171
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40708) Auto update table statistics based on write metrics

2023-05-23 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725491#comment-17725491
 ] 

GridGain Integration commented on SPARK-40708:
--

User 'jackylee-ch' has created a pull request for this issue:
https://github.com/apache/spark/pull/40944

> Auto update table statistics based on write metrics
> ---
>
> Key: SPARK-40708
> URL: https://issues.apache.org/jira/browse/SPARK-40708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
>   // Get write statistics
>   def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): 
> Option[WriteStats] = {
> val numBytes = 
> metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_))
> val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_))
> numBytes.map(WriteStats(mode, _, numRows))
>   }
> // Update table statistics
>   val stat = wroteStats.get
>   stat.mode match {
> case SaveMode.Overwrite | SaveMode.ErrorIfExists =>
>   catalog.alterTableStats(table.identifier,
> Some(CatalogStatistics(stat.numBytes, stat.numRows)))
> case _ if table.stats.nonEmpty => // SaveMode.Append
>   catalog.alterTableStats(table.identifier, None)
> case _ => // SaveMode.Ignore Do nothing
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43264) Avoid allocation of unwritten ColumnVector in VectorizedReader

2023-05-23 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725490#comment-17725490
 ] 

GridGain Integration commented on SPARK-43264:
--

User 'majdyz' has created a pull request for this issue:
https://github.com/apache/spark/pull/40929

> Avoid allocation of unwritten ColumnVector in VectorizedReader
> --
>
> Key: SPARK-43264
> URL: https://issues.apache.org/jira/browse/SPARK-43264
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Zamil Majdy
>Priority: Major
>
> Spark Vectorized Reader allocates the array for every fields for each value 
> count even the array is ended up empty. This causes a high memory consumption 
> when reading a table with large struct+array or many columns with sparse 
> value. One way to fix this is by lazily allocating the column vector and only 
> allocates the array only when it is needed (array is written).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43024) Upgrade pandas to 2.0.0

2023-05-19 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724276#comment-17724276
 ] 

GridGain Integration commented on SPARK-43024:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/41211

> Upgrade pandas to 2.0.0
> ---
>
> Key: SPARK-43024
> URL: https://issues.apache.org/jira/browse/SPARK-43024
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Since pandas 2.0.0 is released in Apr 03, 2023.
>  
> We should update our infra and docs to support it.
> h4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43537) Upgrade the asm deps in the tools module to 9.4

2023-05-17 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723478#comment-17723478
 ] 

GridGain Integration commented on SPARK-43537:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/41198

> Upgrade the asm deps in the tools module to 9.4
> ---
>
> Key: SPARK-43537
> URL: https://issues.apache.org/jira/browse/SPARK-43537
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43206) Connect Better StreamingQueryException

2023-05-01 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718343#comment-17718343
 ] 

GridGain Integration commented on SPARK-43206:
--

User 'WweiL' has created a pull request for this issue:
https://github.com/apache/spark/pull/40966

> Connect Better StreamingQueryException
> --
>
> Key: SPARK-43206
> URL: https://issues.apache.org/jira/browse/SPARK-43206
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Major
> Fix For: 3.5.0
>
>
> [https://github.com/apache/spark/pull/40785#issuecomment-1515522281]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43263) Upgrade FasterXML jackson to 2.15.0

2023-04-26 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716634#comment-17716634
 ] 

GridGain Integration commented on SPARK-43263:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/40933

> Upgrade FasterXML jackson to 2.15.0
> ---
>
> Key: SPARK-43263
> URL: https://issues.apache.org/jira/browse/SPARK-43263
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> * #390: (yaml) Upgrade to Snakeyaml 2.0 (resolves 
> [CVE-2022-1471|https://nvd.nist.gov/vuln/detail/CVE-2022-1471])
>  (contributed by @pjfannin



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43197) Clean up the code written for compatibility with Hadoop 2

2023-04-20 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17714647#comment-17714647
 ] 

GridGain Integration commented on SPARK-43197:
--

User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/40860

> Clean up the code written for compatibility with Hadoop 2
> -
>
> Key: SPARK-43197
> URL: https://issues.apache.org/jira/browse/SPARK-43197
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, SQL, YARN
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> SPARK-42452 removed support for Hadoop2, we can clean up the code written for 
> compatibility with Hadoop 2 to make it more concise



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43215) Remove `ResourceRequestHelper#isYarnResourceTypesAvailable`

2023-04-20 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17714646#comment-17714646
 ] 

GridGain Integration commented on SPARK-43215:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40876

> Remove `ResourceRequestHelper#isYarnResourceTypesAvailable`
> ---
>
> Key: SPARK-43215
> URL: https://issues.apache.org/jira/browse/SPARK-43215
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42657) Support to find and transfer client-side REPL classfiles to server as artifacts

2023-04-18 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713583#comment-17713583
 ] 

GridGain Integration commented on SPARK-42657:
--

User 'vicennial' has created a pull request for this issue:
https://github.com/apache/spark/pull/40675

> Support to find and transfer client-side REPL classfiles to server as 
> artifacts  
> -
>
> Key: SPARK-42657
> URL: https://issues.apache.org/jira/browse/SPARK-42657
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
> Fix For: 3.5.0
>
>
> To run UDFs which are defined on the client side REPL, we require a mechanism 
> that can find the local REPL classfiles and then utilise the mechanism from 
> https://issues.apache.org/jira/browse/SPARK-42653 to transfer them to the 
> server as artifacts.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43146) Implement eager evaluation.

2023-04-17 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713265#comment-17713265
 ] 

GridGain Integration commented on SPARK-43146:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/40800

> Implement eager evaluation.
> ---
>
> Key: SPARK-43146
> URL: https://issues.apache.org/jira/browse/SPARK-43146
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43099) `Class.getCanonicalName` return null for anonymous class on JDK15+, impacting function registry

2023-04-16 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712855#comment-17712855
 ] 

GridGain Integration commented on SPARK-43099:
--

User 'alexjinghn' has created a pull request for this issue:
https://github.com/apache/spark/pull/40747

> `Class.getCanonicalName` return null for anonymous class on JDK15+, impacting 
> function registry
> ---
>
> Key: SPARK-43099
> URL: https://issues.apache.org/jira/browse/SPARK-43099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Alex Jing
>Priority: Major
>
> On JDK15+, lambda and method references are implemented using hidden classes 
> ([https://openjdk.org/jeps/371)] According to the JEP, 
> {quote}{{Class::getCanonicalName}} returns {{{}null{}}}, indicating the 
> hidden class has no canonical name. (Note that the {{Class}} object for an 
> anonymous class in the Java language has the same behavior.)
> {quote}
> This means 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L53]
>  will always be null.
>  
> This can be fixed by replacing `getCanonicalName` with `getName`
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43105) Abbreviate Bytes in proto message's debug string

2023-04-12 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711311#comment-17711311
 ] 

GridGain Integration commented on SPARK-43105:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40750

> Abbreviate Bytes in proto message's debug string
> 
>
> Key: SPARK-43105
> URL: https://issues.apache.org/jira/browse/SPARK-43105
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43063) `df.show` handle null should print NULL instead of null

2023-04-10 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710133#comment-17710133
 ] 

GridGain Integration commented on SPARK-43063:
--

User 'Yikf' has created a pull request for this issue:
https://github.com/apache/spark/pull/40699

> `df.show` handle null should print NULL instead of null
> ---
>
> Key: SPARK-43063
> URL: https://issues.apache.org/jira/browse/SPARK-43063
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: yikaifei
>Priority: Trivial
>
> `df.show` handle null should print NULL instead of null to consistent 
> behavior;
> {code:java}
> Like as the following behavior is currently inconsistent:
> ``` shell
> scala> spark.sql("select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 
> 'New Jersey', 4, 'Seattle') as result").show(false)
> +--+
> |result|
> +--+
> |null  |
> +--+
> ```
> ``` shell
> spark-sql> DESC FUNCTION EXTENDED decode;
> function_desc
> Function: decode
> Class: org.apache.spark.sql.catalyst.expressions.Decode
> Usage:
> decode(bin, charset) - Decodes the first argument using the second 
> argument character set.
> decode(expr, search, result [, search, result ] ... [, default]) - 
> Compares expr
>   to each search value in order. If expr is equal to a search value, 
> decode returns
>   the corresponding result. If no match is found, then it returns 
> default. If default
>   is omitted, it returns null.
> Extended Usage:
> Examples:
>   > SELECT decode(encode('abc', 'utf-8'), 'utf-8');
>abc
>   > SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle', 'Non domestic');
>San Francisco
>   > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle', 'Non domestic');
>Non domestic
>   > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 
> 4, 'Seattle');
>NULL
> Since: 3.2.0
> Time taken: 0.074 seconds, Fetched 4 row(s)
> ```
> ``` shell
> spark-sql> select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New 
> Jersey', 4, 'Seattle');
> NULL
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43076) Removing the dependency on `grpcio` when remote session is not used.

2023-04-10 Thread GridGain Integration (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710083#comment-17710083
 ] 

GridGain Integration commented on SPARK-43076:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/40722

> Removing the dependency on `grpcio` when remote session is not used.
> 
>
> Key: SPARK-43076
> URL: https://issues.apache.org/jira/browse/SPARK-43076
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should not enforce to install `grpcio` when remote session is not used for 
> pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org