[jira] [Commented] (SPARK-49000) Aggregation with DISTINCT gives wrong results when dealing with literals
[ https://issues.apache.org/jira/browse/SPARK-49000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17869348#comment-17869348 ] GridGain Integration commented on SPARK-49000: -- User 'uros-db' has created a pull request for this issue: https://github.com/apache/spark/pull/47482 > Aggregation with DISTINCT gives wrong results when dealing with literals > > > Key: SPARK-49000 > URL: https://issues.apache.org/jira/browse/SPARK-49000 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Uroš Bojanić >Priority: Critical > Labels: pull-request-available > > Aggregation with *DISTINCT* gives wrong results when dealing with literals. > It appears that this bug affects all (or most) released versions of Spark. > > For example: > {code:java} > select count(distinct 1) from t{code} > returns 1, while the correct result should be 0. > > For reference: > {code:java} > select count(1) from t{code} > returns 0, which is the correct and expected result. > > In these examples, suppose that *t* is a table with any columns). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47927) Nullability after join not respected in UDF
[ https://issues.apache.org/jira/browse/SPARK-47927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859940#comment-17859940 ] GridGain Integration commented on SPARK-47927: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/47081 > Nullability after join not respected in UDF > --- > > Key: SPARK-47927 > URL: https://issues.apache.org/jira/browse/SPARK-47927 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1, 3.4.3 >Reporter: Emil Ejbyfeldt >Assignee: Emil Ejbyfeldt >Priority: Major > Labels: correctness, pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.4 > > > {code:java} > val ds1 = Seq(1).toDS() > val ds2 = Seq[Int]().toDS() > val f = udf[(Int, Option[Int]), (Int, Option[Int])](identity) > ds1.join(ds2, ds1("value") === ds2("value"), > "outer").select(f(struct(ds1("value"), ds2("value".show() > ds1.join(ds2, ds1("value") === ds2("value"), > "outer").select(struct(ds1("value"), ds2("value"))).show() {code} > outputs > {code:java} > +---+ > |UDF(struct(value, value, value, value))| > +---+ > | {1, 0}| > +---+ > ++ > |struct(value, value)| > ++ > | {1, NULL}| > ++ {code} > So when the result is passed to UDF the null-ability after the the join is > not respected and we incorrectly end up with a 0 value instead of a null/None > value. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45982) re-org R package installations
[ https://issues.apache.org/jira/browse/SPARK-45982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17787878#comment-17787878 ] GridGain Integration commented on SPARK-45982: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/43904 > re-org R package installations > -- > > Key: SPARK-45982 > URL: https://issues.apache.org/jira/browse/SPARK-45982 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45982) re-org R package installations
[ https://issues.apache.org/jira/browse/SPARK-45982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17787877#comment-17787877 ] GridGain Integration commented on SPARK-45982: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/43904 > re-org R package installations > -- > > Key: SPARK-45982 > URL: https://issues.apache.org/jira/browse/SPARK-45982 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45392) Replace `Class.newInstance()` with `Class.getDeclaredConstructor().newInstance()`
[ https://issues.apache.org/jira/browse/SPARK-45392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770868#comment-17770868 ] GridGain Integration commented on SPARK-45392: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/43193 > Replace `Class.newInstance()` with > `Class.getDeclaredConstructor().newInstance()` > - > > Key: SPARK-45392 > URL: https://issues.apache.org/jira/browse/SPARK-45392 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45014) Clean up fileserver when cleaning up files, jars and archives in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-45014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760658#comment-17760658 ] GridGain Integration commented on SPARK-45014: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/42731 > Clean up fileserver when cleaning up files, jars and archives in SparkContext > - > > Key: SPARK-45014 > URL: https://issues.apache.org/jira/browse/SPARK-45014 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > In SPARK-44348, we clean up Spark Context's added files but we don't clean up > the ones in fileserver. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44931) Fix JSON Serailization for Spark Connect Event Listener
[ https://issues.apache.org/jira/browse/SPARK-44931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758107#comment-17758107 ] GridGain Integration commented on SPARK-44931: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/42630 > Fix JSON Serailization for Spark Connect Event Listener > --- > > Key: SPARK-44931 > URL: https://issues.apache.org/jira/browse/SPARK-44931 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44906) Move substituteAppNExecIds logic into kubernetesConf.annotations method
[ https://issues.apache.org/jira/browse/SPARK-44906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757238#comment-17757238 ] GridGain Integration commented on SPARK-44906: -- User 'zwangsheng' has created a pull request for this issue: https://github.com/apache/spark/pull/42600 > Move substituteAppNExecIds logic into kubernetesConf.annotations method > > > Key: SPARK-44906 > URL: https://issues.apache.org/jira/browse/SPARK-44906 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.1 >Reporter: Binjie Yang >Priority: Major > > Move Utils. SubstituteAppNExecIds logic into KubernetesConf.annotations as > the default logic, easy for users to reuse, rather than to rewrite it again > at the same logic. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44795) CodeGenCache should be ClassLoader specific
[ https://issues.apache.org/jira/browse/SPARK-44795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754852#comment-17754852 ] GridGain Integration commented on SPARK-44795: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/42508 > CodeGenCache should be ClassLoader specific > --- > > Key: SPARK-44795 > URL: https://issues.apache.org/jira/browse/SPARK-44795 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 3.5.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Blocker > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44653) non-trivial DataFrame unions should not break caching
[ https://issues.apache.org/jira/browse/SPARK-44653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754116#comment-17754116 ] GridGain Integration commented on SPARK-44653: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/42483 > non-trivial DataFrame unions should not break caching > - > > Key: SPARK-44653 > URL: https://issues.apache.org/jira/browse/SPARK-44653 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.3.3, 3.4.2, 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43885) DataSource V2: Handle MERGE commands for delta-based sources
[ https://issues.apache.org/jira/browse/SPARK-43885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753921#comment-17753921 ] GridGain Integration commented on SPARK-43885: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/42482 > DataSource V2: Handle MERGE commands for delta-based sources > > > Key: SPARK-43885 > URL: https://issues.apache.org/jira/browse/SPARK-43885 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Fix For: 3.5.0 > > > We should handle MERGE commands for delta-based sources, just like DELETE and > UPDATE. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44447) Use PartitionEvaluator API in FlatMapGroupsInPandasExec, FlatMapCoGroupsInPandasExec
[ https://issues.apache.org/jira/browse/SPARK-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753564#comment-17753564 ] GridGain Integration commented on SPARK-7: -- User 'vinodkc' has created a pull request for this issue: https://github.com/apache/spark/pull/42025 > Use PartitionEvaluator API in FlatMapGroupsInPandasExec, > FlatMapCoGroupsInPandasExec > > > Key: SPARK-7 > URL: https://issues.apache.org/jira/browse/SPARK-7 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Vinod KC >Priority: Major > > Use PartitionEvaluator API in > `FlatMapGroupsInPandasExec` > `FlatMapCoGroupsInPandasExec` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44305) Broadcast operation is not required when no parameters are specified
[ https://issues.apache.org/jira/browse/SPARK-44305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753565#comment-17753565 ] GridGain Integration commented on SPARK-44305: -- User '7mming7' has created a pull request for this issue: https://github.com/apache/spark/pull/42037 > Broadcast operation is not required when no parameters are specified > > > Key: SPARK-44305 > URL: https://issues.apache.org/jira/browse/SPARK-44305 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: 7mming7 >Priority: Minor > Attachments: image-2023-07-05-11-51-41-708.png > > > The ability introduced by SPARK-14912, we can broadcast the parameters of the > data source to the read and write operations, but if the user does not > specify a specific parameter, the propagation operation will also be > performed, which affects the performance has a greater impact, so we need to > avoid broadcasting the full Hadoop parameters when the user does not specify > a specific parameter > > !image-2023-07-05-11-51-41-708.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44407) Prohibit using `enum` as a variable or function name
[ https://issues.apache.org/jira/browse/SPARK-44407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753563#comment-17753563 ] GridGain Integration commented on SPARK-44407: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/41982 > Prohibit using `enum` as a variable or function name > > > Key: SPARK-44407 > URL: https://issues.apache.org/jira/browse/SPARK-44407 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > > {code:java} > [warn] > /Users/yangjie01/SourceCode/git/spark-mine-sbt/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/JavaTypeInferenceSuite.scala:74:21: > [deprecation @ | origin= | version=2.13.7] Wrap `enum` in backticks to use > it as an identifier, it will become a keyword in Scala 3. > [warn] @BeanProperty var enum: java.time.Month = _ {code} > enum will become a keyword in Scala 3. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44756) Executor hangs when RetryingBlockTransferor fails to initiate retry
[ https://issues.apache.org/jira/browse/SPARK-44756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752785#comment-17752785 ] GridGain Integration commented on SPARK-44756: -- User 'hdaikoku' has created a pull request for this issue: https://github.com/apache/spark/pull/42426 > Executor hangs when RetryingBlockTransferor fails to initiate retry > --- > > Key: SPARK-44756 > URL: https://issues.apache.org/jira/browse/SPARK-44756 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 3.3.1 >Reporter: Harunobu Daikoku >Priority: Minor > > We have been observing this issue several times in our production where some > executors are being stuck at BlockTransferService#fetchBlockSync(). > After some investigation, the issue seems to be caused by an unhandled edge > case in RetryingBlockTransferor. > 1. Shuffle transfer fails for whatever reason > {noformat} > java.io.IOException: Cannot allocate memory > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:51) > at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211) > at > org.apache.spark.network.shuffle.SimpleDownloadFile$SimpleDownloadWritableChannel.write(SimpleDownloadFile.java:78) > at > org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onData(OneForOneBlockFetcher.java:340) > at > org.apache.spark.network.client.StreamInterceptor.handle(StreamInterceptor.java:79) > at > org.apache.spark.network.util.TransportFrameDecoder.feedInterceptor(TransportFrameDecoder.java:263) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:87) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > {noformat} > 2. The above exception caught by > [AbstractChannelHandlerContext#invokeChannelRead()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L381], > and propagated to the exception handler > 3. Exception reaches > [RetryingBlockTransferor#initiateRetry()|https://github.com/apache/spark/blob/v3.3.1/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java#L178-L180], > and it tries to initiate retry > {noformat} > 23/08/09 16:58:37 shuffle-client-4-2 INFO RetryingBlockTransferor: Retrying > fetch (1/3) for 1 outstanding blocks after 5000 ms > {noformat} > 4. Retry initiation fails (in our case, it fails to create a new thread) > 5. Exception caught by > [AbstractChannelHandlerContext#invokeExceptionCaught()|https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L305-L309], > and not further processed > {noformat} > 23/08/09 16:58:53 shuffle-client-4-2 DEBUG AbstractChannelHandlerContext: An > exception java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:719) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.spark.network.shuffle.RetryingBlockTransferor.initiateRetry(RetryingBlockTransferor.java:182) > at > org.apache.spark.network.shuffle.RetryingBlockTransferor.access$500(RetryingBlockTransferor.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.handleBlockTransferFailure(RetryingBlockTransferor.java:230) > at > org.apache.spark.network.shuffle.RetryingBlockTransferor$RetryingBlockTransferListener.onBlockFetchFailure(RetryingBlockTransferor.java:260) > at > org.apache.spark.network.shuffle.OneForOneBlockFetcher.failRemainingBlocks(OneForOneBlockFetcher.java:318) > at > org.apache.spark.network.shuffle.OneForOneBlockFetcher.access$300(OneForOneBlockFetcher.java:55) > at > org.apache.spark.network.shuffle.OneForOneBlockFetcher$DownloadCallback.onFailure(OneForOneBlockFetcher.java:357) > at > org.apache.spark.network.client.StreamInterceptor.exceptionCaught(StreamInterceptor.java:56) > at > org.apache.spark.network.util.TransportFrameDecoder.exceptionCaught(TransportFrameDecoder.java:231) > at >
[jira] [Commented] (SPARK-43606) Enable IndexesTests.test_index_basic for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17749848#comment-17749848 ] GridGain Integration commented on SPARK-43606: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/42267 > Enable IndexesTests.test_index_basic for pandas 2.0.0. > -- > > Key: SPARK-43606 > URL: https://issues.apache.org/jira/browse/SPARK-43606 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Enable IndexesTests.test_index_basic for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44098) Introduce python breaking change detection
[ https://issues.apache.org/jira/browse/SPARK-44098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747396#comment-17747396 ] GridGain Integration commented on SPARK-44098: -- User 'StardustDL' has created a pull request for this issue: https://github.com/apache/spark/pull/42125 > Introduce python breaking change detection > -- > > Key: SPARK-44098 > URL: https://issues.apache.org/jira/browse/SPARK-44098 > Project: Spark > Issue Type: Test > Components: Project Infra, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > > We have breaking change detections for Binary Compatibility and Protobufs, > but we don't have one for python. > Authors of [aexpy|https://github.com/StardustDL/aexpy] are willing to help > PySpark detecting python breaking changes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44509) Fine grained interrupt in Python Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-44509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746270#comment-17746270 ] GridGain Integration commented on SPARK-44509: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/42120 > Fine grained interrupt in Python Spark Connect > -- > > Key: SPARK-44509 > URL: https://issues.apache.org/jira/browse/SPARK-44509 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Priority: Major > > Same as https://issues.apache.org/jira/browse/SPARK-44422 but need it for > Python > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44505) DataSource v2 Scans should not require planning the input partitions on explain
[ https://issues.apache.org/jira/browse/SPARK-44505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17745598#comment-17745598 ] GridGain Integration commented on SPARK-44505: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/42099 > DataSource v2 Scans should not require planning the input partitions on > explain > --- > > Key: SPARK-44505 > URL: https://issues.apache.org/jira/browse/SPARK-44505 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > > Right now, we will always call `planInputPartitions()` for a DSv2 > implementation even if there is no spark job run but only explain. > We should provide a way to avoid scanning all input partitions just to > determine if the input is columnar or not. The scan should provide an > override. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43611) Fix unexpected `AnalysisException` from Spark Connect client
[ https://issues.apache.org/jira/browse/SPARK-43611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17743734#comment-17743734 ] GridGain Integration commented on SPARK-43611: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/42040 > Fix unexpected `AnalysisException` from Spark Connect client > > > Key: SPARK-43611 > URL: https://issues.apache.org/jira/browse/SPARK-43611 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Reproducible example: > {code:java} > >>> import pyspark.pandas as ps > >>> psdf1 = ps.DataFrame({"A": [1, 2, 3]}) > >>> psdf2 = ps.DataFrame({"B": [1, 2, 3]}) > >>> psdf1.append(psdf2) > /Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/frame.py:8897: > FutureWarning: The DataFrame.append method is deprecated and will be removed > in a future version. Use pyspark.pandas.concat instead. > warnings.warn( > Traceback (most recent call last): > File "", line 1, in > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/frame.py", > line 8930, in append > return cast(DataFrame, concat([self, other], ignore_index=ignore_index)) > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/namespace.py", > line 2703, in concat > psdfs[0]._internal.copy( > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/internal.py", > line 1508, in copy > return InternalFrame( > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/pandas/internal.py", > line 753, in __init__ > schema = spark_frame.select(data_spark_columns).schema > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/dataframe.py", > line 1650, in schema > return self._session.client.schema(query) > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py", > line 777, in schema > schema = self._analyze(method="schema", plan=plan).schema > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py", > line 958, in _analyze > self._handle_error(error) > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py", > line 1195, in _handle_error > self._handle_rpc_error(error) > File > "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/client.py", > line 1231, in _handle_rpc_error > raise convert_exception(info, status.message) from None > pyspark.errors.exceptions.connect.AnalysisException: When resolving 'A, fail > to find subplan with plan_id=16 in 'Project ['A, 'B] > +- Project [__index_level_0__#1101L, A#1102L, B#1157L, > monotonically_increasing_id() AS __natural_order__#1163L] > +- Union false, false > :- Project [__index_level_0__#1101L, A#1102L, cast(B#1116 as bigint) AS > B#1157L] > : +- Project [__index_level_0__#1101L, A#1102L, B#1116] > : +- Project [__index_level_0__#1101L, A#1102L, > __natural_order__#1108L, null AS B#1116] > : +- Project [__index_level_0__#1101L, A#1102L, > __natural_order__#1108L] > : +- Project [__index_level_0__#1101L, A#1102L, > monotonically_increasing_id() AS __natural_order__#1108L] > : +- Project [__index_level_0__#1097L AS > __index_level_0__#1101L, A#1098L AS A#1102L] > : +- LocalRelation [__index_level_0__#1097L, A#1098L] > +- Project [__index_level_0__#1137L, cast(A#1152 as bigint) AS A#1158L, > B#1138L] > +- Project [__index_level_0__#1137L, A#1152, B#1138L] > +- Project [__index_level_0__#1137L, B#1138L, > __natural_order__#1144L, null AS A#1152] > +- Project [__index_level_0__#1137L, B#1138L, > __natural_order__#1144L] > +- Project [__index_level_0__#1137L, B#1138L, > monotonically_increasing_id() AS __natural_order__#1144L] > +- Project [__index_level_0__#1133L AS > __index_level_0__#1137L, B#1134L AS B#1138L] > +- LocalRelation [__index_level_0__#1133L, B#1134L] > {code} > Another example: > {code:java} > >>> pdf = pd.DataFrame( > ... { > ... "A": [None, 3, None, None], > ... "B": [2, 4, None, 3], > ... "C": [None, None, None, 1], > ... "D": [0, 1, 5, 4], > ... }, > ... columns=["A", "B", "C", "D"], > ... ) > >>> psdf = ps.from_pandas(pdf) > >>> psdf.backfill() > /Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/expressions.py:945: > UserWarning: WARN WindowExpression: No Partition Defined for Window > operation! Moving all data to a single partition, this
[jira] [Commented] (SPARK-44406) DataFrame depending on temp view fail after the view is dropped
[ https://issues.apache.org/jira/browse/SPARK-44406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17742827#comment-17742827 ] GridGain Integration commented on SPARK-44406: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/41986 > DataFrame depending on temp view fail after the view is dropped > --- > > Key: SPARK-44406 > URL: https://issues.apache.org/jira/browse/SPARK-44406 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > > In vanilla Spark: > {code:java} > In [1]: df = spark.createDataFrame([(1, 4), (2, 4), (3, 6)], ["A", "B"]) > In [2]: df.createOrReplaceTempView("t") > In [3]: df2 = spark.sql("select * from t") > In [4]: df2.show() > +---+---+ > > | A| B| > +---+---+ > | 1| 4| > | 2| 4| > | 3| 6| > +---+---+ > In [5]: spark.catalog.dropTempView("t") > Out[5]: True > In [6]: df2.show() > +---+---+ > | A| B| > +---+---+ > | 1| 4| > | 2| 4| > | 3| 6| > +---+---+ > {code} > In Spark Connect: > {code:java} > In [1]: df = spark.createDataFrame([(1, 4), (2, 4), (3, 6)], ["A", "B"]) > In [2]: df.createOrReplaceTempView("t") > In [3]: df2 = spark.sql("select * from t") > In [4]: df2.show() > +---+---+ > | A| B| > +---+---+ > | 1| 4| > | 2| 4| > | 3| 6| > +---+---+ > In [5]: spark.catalog.dropTempView("t") > Out[5]: True > In [6]: df2.show() > 23/07/13 11:57:18 ERROR SparkConnectService: Error during: execute. UserId: > ruifeng.zheng. SessionId: 1fc234fd-07da-4ad0-9ec5-2d818cef6033. > org.apache.spark.sql.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table > or view `t` cannot be found. Verify the spelling and correctness of the > schema and catalog. > If you did not qualify the name with a schema, verify the current_schema() > output, or qualify the name with the correct schema and catalog. > To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF > EXISTS.; line 1 pos 14; > 'Project [*] > +- 'UnresolvedRelation [t], [], false > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43974) Upgrade buf to v1.23.1
[ https://issues.apache.org/jira/browse/SPARK-43974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17742352#comment-17742352 ] GridGain Integration commented on SPARK-43974: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/41937 > Upgrade buf to v1.23.1 > -- > > Key: SPARK-43974 > URL: https://issues.apache.org/jira/browse/SPARK-43974 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44263) Allow ChannelBuilder extensions -- Scala
[ https://issues.apache.org/jira/browse/SPARK-44263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740973#comment-17740973 ] GridGain Integration commented on SPARK-44263: -- User 'cdkrot' has created a pull request for this issue: https://github.com/apache/spark/pull/41880 > Allow ChannelBuilder extensions -- Scala > > > Key: SPARK-44263 > URL: https://issues.apache.org/jira/browse/SPARK-44263 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.1 >Reporter: Alice Sayutina >Priority: Major > > Follow up to https://issues.apache.org/jira/browse/SPARK-43332 > Provide similar extension capabilities in Scala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43801) Support unwrap date type to string type in UnwrapCastInBinaryComparison
[ https://issues.apache.org/jira/browse/SPARK-43801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17734095#comment-17734095 ] GridGain Integration commented on SPARK-43801: -- User 'puchengy' has created a pull request for this issue: https://github.com/apache/spark/pull/41332 > Support unwrap date type to string type in UnwrapCastInBinaryComparison > --- > > Key: SPARK-43801 > URL: https://issues.apache.org/jira/browse/SPARK-43801 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Pucheng Yang >Priority: Major > > Similar to https://issues.apache.org/jira/browse/SPARK-42597 and others, add > support to > UnwrapCastInBinaryComparison such that it can unwrap date type to string type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44065) Optimize BroadcastHashJoin skew when localShuffleReader is disabled
[ https://issues.apache.org/jira/browse/SPARK-44065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733300#comment-17733300 ] GridGain Integration commented on SPARK-44065: -- User 'wForget' has created a pull request for this issue: https://github.com/apache/spark/pull/41609 > Optimize BroadcastHashJoin skew when localShuffleReader is disabled > --- > > Key: SPARK-44065 > URL: https://issues.apache.org/jira/browse/SPARK-44065 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Zhen Wang >Priority: Major > > In RemoteShuffleService services such as uniffle and celeborn, it is > recommended to disable localShuffleReader by default for better performance. > But it may make BroadcastHashJoin skewed, so I want to optimize > BroadcastHashJoin skew in OptimizeSkewedJoin when localShuffleReader is > disabled. > > Refer to: > https://github.com/apache/incubator-celeborn#spark-configuration > https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md#support-spark-aqe -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43511) Implemented State APIs for Spark Connect Scala
[ https://issues.apache.org/jira/browse/SPARK-43511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733195#comment-17733195 ] GridGain Integration commented on SPARK-43511: -- User 'bogao007' has created a pull request for this issue: https://github.com/apache/spark/pull/41558 > Implemented State APIs for Spark Connect Scala > -- > > Key: SPARK-43511 > URL: https://issues.apache.org/jira/browse/SPARK-43511 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Bo Gao >Priority: Major > > Implemented MapGroupsWithState and FlatMapGroupsWithState APIs for Spark > Connect Scala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44057) Mark all `local-cluster` tests as `ExtendedSQLTest`
[ https://issues.apache.org/jira/browse/SPARK-44057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732751#comment-17732751 ] GridGain Integration commented on SPARK-44057: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/41601 > Mark all `local-cluster` tests as `ExtendedSQLTest` > --- > > Key: SPARK-44057 > URL: https://issues.apache.org/jira/browse/SPARK-44057 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.5.0 > > > This issue aims to mark all `local-cluster` tests as `ExtendedSQLTest` > https://pipelines.actions.githubusercontent.com/serviceHosts/03398d36-4378-4d47-a936-fba0a5e8ccb9/_apis/pipelines/1/runs/251144/signedlogcontent/12?urlExpires=2023-06-14T17%3A11%3A50.2399742Z=HMACV1=%2FHTlrgaHtF2Jv65vw%2Fj4SzT69etebI0swSSM6dXC0tk%3D > {code} > $ git grep local-cluster sql/core/ > sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala: > val session = SparkSession.builder().master("local-cluster[3, 1, > 1024]").getOrCreate() > sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala: > val session = SparkSession.builder().master("local-cluster[3, 1, > 1024]").getOrCreate() > sql/core/src/test/scala/org/apache/spark/sql/execution/BroadcastExchangeSuite.scala:// > Additional tests run in 'local-cluster' mode. > sql/core/src/test/scala/org/apache/spark/sql/execution/BroadcastExchangeSuite.scala: > .setMaster("local-cluster[2,1,1024]") > sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala: > "--master", "local-cluster[1,1,1024]", > sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala: >* Create a new [[SparkSession]] running in local-cluster mode with unsafe > and codegen enabled. > sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala: > .master("local-cluster[2,1,1024]") > sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala: > * Tests in this suite we need to run Spark in local-cluster mode. In > particular, the use of > sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala: >* Create a new [[SparkSession]] running in local-cluster mode with unsafe > and codegen enabled. > sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala: > .master("local-cluster[2,1,512]") > sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDDSuite.scala: > .config(sparkConf.setMaster("local-cluster[2, 1, 1024]")) > sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala: > // Create a new [[SparkSession]] running in local-cluster mode. > sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala: > .master("local-cluster[2,1,1024]") > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43943) Add math functions to Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-43943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728738#comment-17728738 ] GridGain Integration commented on SPARK-43943: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/41435 > Add math functions to Scala and Python > -- > > Key: SPARK-43943 > URL: https://issues.apache.org/jira/browse/SPARK-43943 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > > Add following functions: > * ceiling > * e > * pi > * ln > * negative > * positive > * power > * sign > * std > * width_bucket > to: > * Scala API > * Python API > * Spark Connect Scala Client > * Spark Connect Python Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43075) Change gRPC to grpcio when it is not installed.
[ https://issues.apache.org/jira/browse/SPARK-43075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728681#comment-17728681 ] GridGain Integration commented on SPARK-43075: -- User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/40716 > Change gRPC to grpcio when it is not installed. > --- > > Key: SPARK-43075 > URL: https://issues.apache.org/jira/browse/SPARK-43075 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0, 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43063) `df.show` handle null should print NULL instead of null
[ https://issues.apache.org/jira/browse/SPARK-43063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728682#comment-17728682 ] GridGain Integration commented on SPARK-43063: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/41432 > `df.show` handle null should print NULL instead of null > --- > > Key: SPARK-43063 > URL: https://issues.apache.org/jira/browse/SPARK-43063 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: yikaifei >Assignee: yikaifei >Priority: Trivial > Fix For: 3.5.0 > > > `df.show` handle null should print NULL instead of null to consistent > behavior; > {code:java} > Like as the following behavior is currently inconsistent: > ``` shell > scala> spark.sql("select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, > 'New Jersey', 4, 'Seattle') as result").show(false) > +--+ > |result| > +--+ > |null | > +--+ > ``` > ``` shell > spark-sql> DESC FUNCTION EXTENDED decode; > function_desc > Function: decode > Class: org.apache.spark.sql.catalyst.expressions.Decode > Usage: > decode(bin, charset) - Decodes the first argument using the second > argument character set. > decode(expr, search, result [, search, result ] ... [, default]) - > Compares expr > to each search value in order. If expr is equal to a search value, > decode returns > the corresponding result. If no match is found, then it returns > default. If default > is omitted, it returns null. > Extended Usage: > Examples: > > SELECT decode(encode('abc', 'utf-8'), 'utf-8'); >abc > > SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', > 4, 'Seattle', 'Non domestic'); >San Francisco > > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', > 4, 'Seattle', 'Non domestic'); >Non domestic > > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', > 4, 'Seattle'); >NULL > Since: 3.2.0 > Time taken: 0.074 seconds, Fetched 4 row(s) > ``` > ``` shell > spark-sql> select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New > Jersey', 4, 'Seattle'); > NULL > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43205) Add an IDENTIFIER(stringLiteral) clause that maps a string to an identifier
[ https://issues.apache.org/jira/browse/SPARK-43205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727621#comment-17727621 ] GridGain Integration commented on SPARK-43205: -- User 'srielau' has created a pull request for this issue: https://github.com/apache/spark/pull/40884 > Add an IDENTIFIER(stringLiteral) clause that maps a string to an identifier > --- > > Key: SPARK-43205 > URL: https://issues.apache.org/jira/browse/SPARK-43205 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.5.0 >Reporter: Serge Rielau >Assignee: Serge Rielau >Priority: Major > Fix For: 3.5.0 > > > There is a requirement for SQL templates, where the table and or column names > are provided through substitution. This can be done today using variable > substitution: > SET hivevar:tabname = mytab; > SELECT * FROM ${ hivevar:tabname }; > A straight variable substitution is dangerous since it does allow for SQL > injection: > SET hivevar:tabname = mytab, someothertab; > SELECT * FROM ${ hivevar:tabname }; > A way to get around this problem is to wrap the variable substitution with a > clause that limits the scope t produce an identifier. > This approach is taken by Snowflake: > > [https://docs.snowflake.com/en/sql-reference/session-variables#using-variables-in-sql] > SET hivevar:tabname = 'tabname'; > SELECT * FROM IDENTIFIER(${ hivevar:tabname }) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43171) Support dynamic changing unix user in Pod
[ https://issues.apache.org/jira/browse/SPARK-43171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725489#comment-17725489 ] GridGain Integration commented on SPARK-43171: -- User 'pan3793' has created a pull request for this issue: https://github.com/apache/spark/pull/40831 > Support dynamic changing unix user in Pod > - > > Key: SPARK-43171 > URL: https://issues.apache.org/jira/browse/SPARK-43171 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40708) Auto update table statistics based on write metrics
[ https://issues.apache.org/jira/browse/SPARK-40708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725491#comment-17725491 ] GridGain Integration commented on SPARK-40708: -- User 'jackylee-ch' has created a pull request for this issue: https://github.com/apache/spark/pull/40944 > Auto update table statistics based on write metrics > --- > > Key: SPARK-40708 > URL: https://issues.apache.org/jira/browse/SPARK-40708 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > // Get write statistics > def getWriteStats(mode: SaveMode, metrics: Map[String, SQLMetric]): > Option[WriteStats] = { > val numBytes = > metrics.get(NUM_OUTPUT_BYTES_KEY).map(_.value).map(BigInt(_)) > val numRows = metrics.get(NUM_OUTPUT_ROWS_KEY).map(_.value).map(BigInt(_)) > numBytes.map(WriteStats(mode, _, numRows)) > } > // Update table statistics > val stat = wroteStats.get > stat.mode match { > case SaveMode.Overwrite | SaveMode.ErrorIfExists => > catalog.alterTableStats(table.identifier, > Some(CatalogStatistics(stat.numBytes, stat.numRows))) > case _ if table.stats.nonEmpty => // SaveMode.Append > catalog.alterTableStats(table.identifier, None) > case _ => // SaveMode.Ignore Do nothing > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43264) Avoid allocation of unwritten ColumnVector in VectorizedReader
[ https://issues.apache.org/jira/browse/SPARK-43264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725490#comment-17725490 ] GridGain Integration commented on SPARK-43264: -- User 'majdyz' has created a pull request for this issue: https://github.com/apache/spark/pull/40929 > Avoid allocation of unwritten ColumnVector in VectorizedReader > -- > > Key: SPARK-43264 > URL: https://issues.apache.org/jira/browse/SPARK-43264 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Zamil Majdy >Priority: Major > > Spark Vectorized Reader allocates the array for every fields for each value > count even the array is ended up empty. This causes a high memory consumption > when reading a table with large struct+array or many columns with sparse > value. One way to fix this is by lazily allocating the column vector and only > allocates the array only when it is needed (array is written). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43024) Upgrade pandas to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-43024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724276#comment-17724276 ] GridGain Integration commented on SPARK-43024: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/41211 > Upgrade pandas to 2.0.0 > --- > > Key: SPARK-43024 > URL: https://issues.apache.org/jira/browse/SPARK-43024 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Since pandas 2.0.0 is released in Apr 03, 2023. > > We should update our infra and docs to support it. > h4. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43537) Upgrade the asm deps in the tools module to 9.4
[ https://issues.apache.org/jira/browse/SPARK-43537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723478#comment-17723478 ] GridGain Integration commented on SPARK-43537: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/41198 > Upgrade the asm deps in the tools module to 9.4 > --- > > Key: SPARK-43537 > URL: https://issues.apache.org/jira/browse/SPARK-43537 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43206) Connect Better StreamingQueryException
[ https://issues.apache.org/jira/browse/SPARK-43206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718343#comment-17718343 ] GridGain Integration commented on SPARK-43206: -- User 'WweiL' has created a pull request for this issue: https://github.com/apache/spark/pull/40966 > Connect Better StreamingQueryException > -- > > Key: SPARK-43206 > URL: https://issues.apache.org/jira/browse/SPARK-43206 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Assignee: Wei Liu >Priority: Major > Fix For: 3.5.0 > > > [https://github.com/apache/spark/pull/40785#issuecomment-1515522281] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43263) Upgrade FasterXML jackson to 2.15.0
[ https://issues.apache.org/jira/browse/SPARK-43263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716634#comment-17716634 ] GridGain Integration commented on SPARK-43263: -- User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/40933 > Upgrade FasterXML jackson to 2.15.0 > --- > > Key: SPARK-43263 > URL: https://issues.apache.org/jira/browse/SPARK-43263 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Priority: Major > > * #390: (yaml) Upgrade to Snakeyaml 2.0 (resolves > [CVE-2022-1471|https://nvd.nist.gov/vuln/detail/CVE-2022-1471]) > (contributed by @pjfannin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43197) Clean up the code written for compatibility with Hadoop 2
[ https://issues.apache.org/jira/browse/SPARK-43197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17714647#comment-17714647 ] GridGain Integration commented on SPARK-43197: -- User 'pan3793' has created a pull request for this issue: https://github.com/apache/spark/pull/40860 > Clean up the code written for compatibility with Hadoop 2 > - > > Key: SPARK-43197 > URL: https://issues.apache.org/jira/browse/SPARK-43197 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, SQL, YARN >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > > SPARK-42452 removed support for Hadoop2, we can clean up the code written for > compatibility with Hadoop 2 to make it more concise -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43215) Remove `ResourceRequestHelper#isYarnResourceTypesAvailable`
[ https://issues.apache.org/jira/browse/SPARK-43215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17714646#comment-17714646 ] GridGain Integration commented on SPARK-43215: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40876 > Remove `ResourceRequestHelper#isYarnResourceTypesAvailable` > --- > > Key: SPARK-43215 > URL: https://issues.apache.org/jira/browse/SPARK-43215 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42657) Support to find and transfer client-side REPL classfiles to server as artifacts
[ https://issues.apache.org/jira/browse/SPARK-42657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713583#comment-17713583 ] GridGain Integration commented on SPARK-42657: -- User 'vicennial' has created a pull request for this issue: https://github.com/apache/spark/pull/40675 > Support to find and transfer client-side REPL classfiles to server as > artifacts > - > > Key: SPARK-42657 > URL: https://issues.apache.org/jira/browse/SPARK-42657 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Venkata Sai Akhil Gudesa >Priority: Major > Fix For: 3.5.0 > > > To run UDFs which are defined on the client side REPL, we require a mechanism > that can find the local REPL classfiles and then utilise the mechanism from > https://issues.apache.org/jira/browse/SPARK-42653 to transfer them to the > server as artifacts. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43146) Implement eager evaluation.
[ https://issues.apache.org/jira/browse/SPARK-43146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713265#comment-17713265 ] GridGain Integration commented on SPARK-43146: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40800 > Implement eager evaluation. > --- > > Key: SPARK-43146 > URL: https://issues.apache.org/jira/browse/SPARK-43146 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43099) `Class.getCanonicalName` return null for anonymous class on JDK15+, impacting function registry
[ https://issues.apache.org/jira/browse/SPARK-43099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712855#comment-17712855 ] GridGain Integration commented on SPARK-43099: -- User 'alexjinghn' has created a pull request for this issue: https://github.com/apache/spark/pull/40747 > `Class.getCanonicalName` return null for anonymous class on JDK15+, impacting > function registry > --- > > Key: SPARK-43099 > URL: https://issues.apache.org/jira/browse/SPARK-43099 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Alex Jing >Priority: Major > > On JDK15+, lambda and method references are implemented using hidden classes > ([https://openjdk.org/jeps/371)] According to the JEP, > {quote}{{Class::getCanonicalName}} returns {{{}null{}}}, indicating the > hidden class has no canonical name. (Note that the {{Class}} object for an > anonymous class in the Java language has the same behavior.) > {quote} > This means > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L53] > will always be null. > > This can be fixed by replacing `getCanonicalName` with `getName` > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43105) Abbreviate Bytes in proto message's debug string
[ https://issues.apache.org/jira/browse/SPARK-43105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711311#comment-17711311 ] GridGain Integration commented on SPARK-43105: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40750 > Abbreviate Bytes in proto message's debug string > > > Key: SPARK-43105 > URL: https://issues.apache.org/jira/browse/SPARK-43105 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43063) `df.show` handle null should print NULL instead of null
[ https://issues.apache.org/jira/browse/SPARK-43063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710133#comment-17710133 ] GridGain Integration commented on SPARK-43063: -- User 'Yikf' has created a pull request for this issue: https://github.com/apache/spark/pull/40699 > `df.show` handle null should print NULL instead of null > --- > > Key: SPARK-43063 > URL: https://issues.apache.org/jira/browse/SPARK-43063 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: yikaifei >Priority: Trivial > > `df.show` handle null should print NULL instead of null to consistent > behavior; > {code:java} > Like as the following behavior is currently inconsistent: > ``` shell > scala> spark.sql("select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, > 'New Jersey', 4, 'Seattle') as result").show(false) > +--+ > |result| > +--+ > |null | > +--+ > ``` > ``` shell > spark-sql> DESC FUNCTION EXTENDED decode; > function_desc > Function: decode > Class: org.apache.spark.sql.catalyst.expressions.Decode > Usage: > decode(bin, charset) - Decodes the first argument using the second > argument character set. > decode(expr, search, result [, search, result ] ... [, default]) - > Compares expr > to each search value in order. If expr is equal to a search value, > decode returns > the corresponding result. If no match is found, then it returns > default. If default > is omitted, it returns null. > Extended Usage: > Examples: > > SELECT decode(encode('abc', 'utf-8'), 'utf-8'); >abc > > SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', > 4, 'Seattle', 'Non domestic'); >San Francisco > > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', > 4, 'Seattle', 'Non domestic'); >Non domestic > > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', > 4, 'Seattle'); >NULL > Since: 3.2.0 > Time taken: 0.074 seconds, Fetched 4 row(s) > ``` > ``` shell > spark-sql> select decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New > Jersey', 4, 'Seattle'); > NULL > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43076) Removing the dependency on `grpcio` when remote session is not used.
[ https://issues.apache.org/jira/browse/SPARK-43076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710083#comment-17710083 ] GridGain Integration commented on SPARK-43076: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/40722 > Removing the dependency on `grpcio` when remote session is not used. > > > Key: SPARK-43076 > URL: https://issues.apache.org/jira/browse/SPARK-43076 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > We should not enforce to install `grpcio` when remote session is not used for > pandas API on Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org