[jira] [Updated] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung
[ https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mars updated SPARK-40320: - Description: *Reproduce step:* set `spark.plugins=ErrorSparkPlugin` `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the code to make it clearer): {code:java} class ErrorSparkPlugin extends SparkPlugin { /** */ override def driverPlugin(): DriverPlugin = new ErrorDriverPlugin() /** */ override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin() }{code} {code:java} class ErrorExecutorPlugin extends ExecutorPlugin { private val checkingInterval: Long = 1 override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): Unit = { if (checkingInterval == 1) { throw new UnsatisfiedLinkError("My Exception error") } } } {code} The Executor is active when we check in spark-ui, however it was broken and doesn't receive any task. *Root Cause:* I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method `dealWithFatalError` . Actually the `CoarseGrainedExecutorBackend` JVM process is active but the communication thread is no longer working ( please see `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, so executor doesn't receive any message) Some ideas: I think it is very hard to know what happened here unless we check in the code. The Executor is active but it can't do anything. We will wonder if the driver is broken or the Executor problem. I think at least the Executor status shouldn't be active here or the Executor can exitExecutor (kill itself) was: *Reproduce step:* set `spark.plugins=ErrorSparkPlugin` `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the code to make it clearer): {code:java} class ErrorSparkPlugin extends SparkPlugin { /** */ override def driverPlugin(): DriverPlugin = new ErrorDriverPlugin() /** */ override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin() }{code} {code:java} class ErrorExecutorPlugin extends ExecutorPlugin { private val checkingInterval: Long = 1 override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): Unit = { if (checkingInterval == 1) { throw new UnsatisfiedLinkError("My Exception error") } } } {code} The Executor is active when we check in spark-ui, however it was broken and doesn't receive any task. *Root Cause:* I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method `dealWithFatalError` . Actually the `CoarseGrainedExecutorBackend` JVM process is active but the communication thread is no longer working ( please see `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, so executor doesn't receive any message) Some ideas: I think it is very hard to know what happened here unless we check in the code. The Executor is active but it can't do anything. We will wonder if the driver is broken or the Executor problem. I think at least the Executor status shouldn't be active here or the Executor can exitExecutor (kill itself) > When the Executor plugin fails to initialize, the Executor shows active but > does not accept tasks forever, just like being hung > --- > > Key: SPARK-40320 > URL: https://issues.apache.org/jira/browse/SPARK-40320 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: Mars >Priority: Major > > *Reproduce step:* > set `spark.plugins=ErrorSparkPlugin` > `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the > code to make it clearer): > {code:java} > class ErrorSparkPlugin extends SparkPlugin { > /** >*/ > override def driverPlugin(): DriverPlugin = new ErrorDriverPlugin() > /** >*/ > override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin() > }{code} > {code:java} > class ErrorExecutorPlugin extends ExecutorPlugin { > private val checkingInterval: Long = 1 > override def init(_ctx: PluginContext, extraConf: util.Map[String, > String]): Unit = { > if (checkingInterval == 1) { > throw new UnsatisfiedLinkError("My Exception error") > } > } > } {code} > The Executor is active when we check in spark-ui, however it was broken and > doesn't receive any task. > *Root Cause:* > I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` > it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method > `dealWithFatalError` . Actually the `CoarseGrainedExecutorBackend` JVM >
[jira] [Updated] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung
[ https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mars updated SPARK-40320: - Description: *Reproduce step:* set `spark.plugins=ErrorSparkPlugin` `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the code to make it clearer): {code:java} class ErrorSparkPlugin extends SparkPlugin { /** */ override def driverPlugin(): DriverPlugin = new ErrorDriverPlugin() /** */ override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin() }{code} {code:java} class ErrorExecutorPlugin extends ExecutorPlugin { private val checkingInterval: Long = 1 override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): Unit = { if (checkingInterval == 1) { throw new UnsatisfiedLinkError("My Exception error") } } } {code} The Executor is active when we check in spark-ui, however it was broken and doesn't receive any task. *Root Cause:* I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method `dealWithFatalError` . Actually the `CoarseGrainedExecutorBackend` JVM process is active but the communication thread is no longer working ( please see `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, so executor doesn't receive any message) Some ideas: I think it is very hard to know what happened here unless we check in the code. The Executor is active but it can't do anything. We will wonder if the driver is broken or the Executor problem. I think at least the Executor status shouldn't be active here or the Executor can exitExecutor (kill itself) was: *Reproduce step:* set `spark.plugins=ErrorSparkPlugin` `ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to make it clearer): {code:java} class ErrorSparkPlugin extends SparkPlugin { /** */ override def driverPlugin(): DriverPlugin = new ErrorDriverPlugin() /** */ override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin() }{code} {code:java} class ErrorExecutorPlugin extends ExecutorPlugin with Logging { private val checkingInterval: Long = 1 override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): Unit = { if (checkingInterval == 1) { throw new UnsatisfiedLinkError("LCL my Exception error2") } } } {code} The Executor is active when we check in spark-ui, however it was broken and doesn't receive any task. *Root Cause:* I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method `dealWithFatalError` . Actually the `CoarseGrainedExecutorBackend` JVM process is active but the communication thread is no longer working ( please see `MessageLoop#receiveLoopRunnable` , `receiveLoop()` while was broken here, so executor doesn't receive any message) Some ideas: I think it is very hard to know what happened here unless we check in the code. The Executor is active but it can't do anything. We will wonder if the driver is broken or the Executor problem. I think at least the Executor status shouldn't be active here or the Executor can exitExecutor (kill itself) > When the Executor plugin fails to initialize, the Executor shows active but > does not accept tasks forever, just like being hung > --- > > Key: SPARK-40320 > URL: https://issues.apache.org/jira/browse/SPARK-40320 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: Mars >Priority: Major > > *Reproduce step:* > set `spark.plugins=ErrorSparkPlugin` > `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the > code to make it clearer): > {code:java} > class ErrorSparkPlugin extends SparkPlugin { > /** >*/ > override def driverPlugin(): DriverPlugin = new ErrorDriverPlugin() > /** >*/ > override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin() > }{code} > {code:java} > class ErrorExecutorPlugin extends ExecutorPlugin { > private val checkingInterval: Long = 1 > override def init(_ctx: PluginContext, extraConf: util.Map[String, > String]): Unit = { > if (checkingInterval == 1) { > throw new UnsatisfiedLinkError("My Exception error") > } > } > } {code} > The Executor is active when we check in spark-ui, however it was broken and > doesn't receive any task. > *Root Cause:* > I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` > it will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in > method `dealWithFatalError` . Actually the
[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600203#comment-17600203 ] Ruifeng Zheng commented on SPARK-40327: --- cc [~yikunkero] If you want to have a try, feel free to take over some of those subtasks, thanks in advance! > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-40298) shuffle data recovery on the reused PVCs no effect
[ https://issues.apache.org/jira/browse/SPARK-40298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] todd reopened SPARK-40298: -- > shuffle data recovery on the reused PVCs no effect > --- > > Key: SPARK-40298 > URL: https://issues.apache.org/jira/browse/SPARK-40298 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: 1662002808396.jpg, 1662002822097.jpg > > > I use spark3.2.2 to test the [ Support shuffle data recovery on the reused > PVCs (SPARK-35593) ] feature.I found that when shuffle read fails, data is > still read from source. > It can be confirmed that the pvc has been multiplexed by other pods, and the > Index and data data information has been sent > *This is my spark configuration information:* > --conf spark.driver.memory=5G > --conf spark.executor.memory=15G > --conf spark.executor.cores=1 > --conf spark.executor.instances=50 > --conf spark.sql.shuffle.partitions=50 > --conf spark.dynamicAllocation.enabled=false > --conf spark.kubernetes.driver.reusePersistentVolumeClaim=true > --conf spark.kubernetes.driver.ownPersistentVolumeClaim=true > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=gp2 > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=100Gi > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false > --conf spark.executorEnv.SPARK_EXECUTOR_DIRS=/tmp/data > --conf > spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO > --conf spark.kubernetes.executor.missingPodDetectDelta=10s > --conf spark.kubernetes.executor.apiPollingInterval=10s > --conf spark.shuffle.io.retryWait=60s > --conf spark.shuffle.io.maxRetries=5 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40349) Implement `RollingGroupby.sem`.
Haejoon Lee created SPARK-40349: --- Summary: Implement `RollingGroupby.sem`. Key: SPARK-40349 URL: https://issues.apache.org/jira/browse/SPARK-40349 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `RollingGroupby.sem` for increasing pandas API coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40348) Implement `RollingGroupby.quantile`.
Haejoon Lee created SPARK-40348: --- Summary: Implement `RollingGroupby.quantile`. Key: SPARK-40348 URL: https://issues.apache.org/jira/browse/SPARK-40348 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `RollingGroupby.quantile` for increasing pandas API coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40347) Implement `RollingGroupby.median`.
Haejoon Lee created SPARK-40347: --- Summary: Implement `RollingGroupby.median`. Key: SPARK-40347 URL: https://issues.apache.org/jira/browse/SPARK-40347 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `RollingGroupby.median` for increasing pandas API coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40346) Implement `ExpandingGroupby.sem`.
Haejoon Lee created SPARK-40346: --- Summary: Implement `ExpandingGroupby.sem`. Key: SPARK-40346 URL: https://issues.apache.org/jira/browse/SPARK-40346 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `ExpandingGroupby.sem` for increasing pandas API coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40345) Implement `ExpandingGroupby.quantile`.
Haejoon Lee created SPARK-40345: --- Summary: Implement `ExpandingGroupby.quantile`. Key: SPARK-40345 URL: https://issues.apache.org/jira/browse/SPARK-40345 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `ExpandingGroupby.quantile` for increasing pandas API coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40344) Implement `ExpandingGroupby.median`.
Haejoon Lee created SPARK-40344: --- Summary: Implement `ExpandingGroupby.median`. Key: SPARK-40344 URL: https://issues.apache.org/jira/browse/SPARK-40344 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `ExpandingGroupby.median` for increasing pandas API coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40331) Java 11 should be used as the recommended running environment
[ https://issues.apache.org/jira/browse/SPARK-40331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-40331: - Description: Similar cases described in SPARK-40303 will not have negative effects if Java 11+ is used as runtime was: Similar cases described in SPARK-40303 will not have negative effects if Java 11+ is used as runtime > Java 11 should be used as the recommended running environment > - > > Key: SPARK-40331 > URL: https://issues.apache.org/jira/browse/SPARK-40331 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > Similar cases described in SPARK-40303 will not have negative effects if > Java 11+ is used as runtime > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40343) Implement `Rolling.sem`.
Haejoon Lee created SPARK-40343: --- Summary: Implement `Rolling.sem`. Key: SPARK-40343 URL: https://issues.apache.org/jira/browse/SPARK-40343 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `Rolling.sem` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.sem.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40342) Implement `Rolling.quantile`.
Haejoon Lee created SPARK-40342: --- Summary: Implement `Rolling.quantile`. Key: SPARK-40342 URL: https://issues.apache.org/jira/browse/SPARK-40342 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `Rolling.quantile` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40340) Implement `Expanding.sem`.
Haejoon Lee created SPARK-40340: --- Summary: Implement `Expanding.sem`. Key: SPARK-40340 URL: https://issues.apache.org/jira/browse/SPARK-40340 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `Expanding.sem` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.sem.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40341) Implement `Rolling.median`.
Haejoon Lee created SPARK-40341: --- Summary: Implement `Rolling.median`. Key: SPARK-40341 URL: https://issues.apache.org/jira/browse/SPARK-40341 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `Rolling.median` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.median.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40338) Implement `Expanding.median`.
[ https://issues.apache.org/jira/browse/SPARK-40338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40338: Description: We should implement `Expanding.median` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.median.html > Implement `Expanding.median`. > - > > Key: SPARK-40338 > URL: https://issues.apache.org/jira/browse/SPARK-40338 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should implement `Expanding.median` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.median.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40339) Implement `Expanding.quantile`.
Haejoon Lee created SPARK-40339: --- Summary: Implement `Expanding.quantile`. Key: SPARK-40339 URL: https://issues.apache.org/jira/browse/SPARK-40339 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `Expanding.quantile` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40337) Implement `SeriesGroupBy.describe`.
Haejoon Lee created SPARK-40337: --- Summary: Implement `SeriesGroupBy.describe`. Key: SPARK-40337 URL: https://issues.apache.org/jira/browse/SPARK-40337 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `SeriesGroupBy.describe` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.describe.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40338) Implement `Expanding.median`.
[ https://issues.apache.org/jira/browse/SPARK-40338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40338: Summary: Implement `Expanding.median`. (was: Imple) > Implement `Expanding.median`. > - > > Key: SPARK-40338 > URL: https://issues.apache.org/jira/browse/SPARK-40338 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40338) Imple
Haejoon Lee created SPARK-40338: --- Summary: Imple Key: SPARK-40338 URL: https://issues.apache.org/jira/browse/SPARK-40338 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40336) Implement `DataFrameGroupBy.cov`.
[ https://issues.apache.org/jira/browse/SPARK-40336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40336: Summary: Implement `DataFrameGroupBy.cov`. (was: Implement `DataFrame.cov`.) > Implement `DataFrameGroupBy.cov`. > - > > Key: SPARK-40336 > URL: https://issues.apache.org/jira/browse/SPARK-40336 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should implement `DataFrameGroupBy.cov` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.cov.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40336) Implement `DataFrame.cov`.
Haejoon Lee created SPARK-40336: --- Summary: Implement `DataFrame.cov`. Key: SPARK-40336 URL: https://issues.apache.org/jira/browse/SPARK-40336 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `DataFrameGroupBy.cov` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.cov.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40292) arrays_zip output unexpected alias column names
[ https://issues.apache.org/jira/browse/SPARK-40292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600198#comment-17600198 ] Ivan Sadikov commented on SPARK-40292: -- I will take a look. > arrays_zip output unexpected alias column names > --- > > Key: SPARK-40292 > URL: https://issues.apache.org/jira/browse/SPARK-40292 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Linhong Liu >Priority: Major > > For the below query: > {code:sql} > with q as ( > select > named_struct( > 'my_array', array(named_struct('x', 1, 'y', 2)) > ) as my_struct > ) > select > arrays_zip(my_struct.my_array) > from > q {code} > The latest spark gives the below schema, the field name "my_array" was > changed to "0" > {code:java} > root > |-- arrays_zip(my_struct.my_array): array (nullable = true) > | |-- element: struct (containsNull = false) > | | |-- 0: struct (nullable = true) > | | | |-- x: integer (nullable = true) > | | | |-- y: integer (nullable = true){code} > While Spark 3.1 gives the expected result > {code:java} > root > |-- arrays_zip(my_struct.my_array): array (nullable = true) > ||-- element: struct (containsNull = false) > |||-- my_array: struct (nullable = true) > ||||-- x: integer (nullable = true) > ||||-- y: integer (nullable = true) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40335) Implement `DataFrameGroupBy.corr`.
Haejoon Lee created SPARK-40335: --- Summary: Implement `DataFrameGroupBy.corr`. Key: SPARK-40335 URL: https://issues.apache.org/jira/browse/SPARK-40335 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `DataFrameGroupBy.corr` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.corr.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40286. -- Target Version/s: (was: 3.2.1) Resolution: Invalid > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40334) Implement `GroupBy.prod`.
Haejoon Lee created SPARK-40334: --- Summary: Implement `GroupBy.prod`. Key: SPARK-40334 URL: https://issues.apache.org/jira/browse/SPARK-40334 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `GroupBy.prod` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.prod.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40333) Implement `GroupBy.nth`.
Haejoon Lee created SPARK-40333: --- Summary: Implement `GroupBy.nth`. Key: SPARK-40333 URL: https://issues.apache.org/jira/browse/SPARK-40333 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `DataFrame.compare` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40333) Implement `GroupBy.nth`.
[ https://issues.apache.org/jira/browse/SPARK-40333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40333: Description: We should implement `GroupBy.nth` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html was: We should implement `DataFrame.compare` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html > Implement `GroupBy.nth`. > > > Key: SPARK-40333 > URL: https://issues.apache.org/jira/browse/SPARK-40333 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should implement `GroupBy.nth` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-40287) Load Data using Spark by a single partition moves entire dataset under same location in S3
[ https://issues.apache.org/jira/browse/SPARK-40287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-40287: -- > Load Data using Spark by a single partition moves entire dataset under same > location in S3 > -- > > Key: SPARK-40287 > URL: https://issues.apache.org/jira/browse/SPARK-40287 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm experiencing an issue in PySpark when creating a hive table and loading > in the data to the table. So I'm using an Amazon s3 bucket as a data location > and I'm creating a table as parquet and trying to load data into that table > by a single partition, and I'm seeing some weird behavior. When selecting the > data location in s3 of a parquet file to load into my table. All of the data > is moved into the specified location in my create table command including the > partitions I didn't specify in the load data command. For example: > {code:java} > # create a data frame in pyspark with partitions > df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], > ["c1", "c2", "p"]) > # save it to S3 > df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/") > {code} > In the current state S3 should have a new folder `data` with two folders > which contain a parquet file in each partition. > > - s3://bucket/data/p=x/ > - part-1.snappy.parquet > - s3://bucket/data/p=y/ > - part-2.snappy.parquet > - part-3.snappy.parquet > > {code:java} > # create new table > spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) > STORED AS parquet LOCATION 's3://bucket/new/'") > # load the saved table data from s3 specifying single partition value x > spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION > (p='x')") > spark.sql("select * from src").show() > # output: > # +---+---+---+ > # | c1| c2| p| > # +---+---+---+ > # +---+---+---+ > {code} > After running the `load data` command, and looking at the table I'm left with > no data loaded in. When checking S3 the data source we saved earlier is moved > under `s3://bucket/new/` oddly enough it also brought over the other > partitions along with it directory structure listed below. > - s3://bucket/new/ > - p=x/ > - p=x/ > - part-1.snappy.parquet > - p=y/ > - part-2.snappy.parquet > - part-3.snappy.parquet > Is this the intended behavior of loading the data in from a partitioned > parquet file? Is the previous file supposed to be moved/deleted from source > directory? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40287) Load Data using Spark by a single partition moves entire dataset under same location in S3
[ https://issues.apache.org/jira/browse/SPARK-40287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40287. -- Resolution: Invalid > Load Data using Spark by a single partition moves entire dataset under same > location in S3 > -- > > Key: SPARK-40287 > URL: https://issues.apache.org/jira/browse/SPARK-40287 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm experiencing an issue in PySpark when creating a hive table and loading > in the data to the table. So I'm using an Amazon s3 bucket as a data location > and I'm creating a table as parquet and trying to load data into that table > by a single partition, and I'm seeing some weird behavior. When selecting the > data location in s3 of a parquet file to load into my table. All of the data > is moved into the specified location in my create table command including the > partitions I didn't specify in the load data command. For example: > {code:java} > # create a data frame in pyspark with partitions > df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], > ["c1", "c2", "p"]) > # save it to S3 > df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/") > {code} > In the current state S3 should have a new folder `data` with two folders > which contain a parquet file in each partition. > > - s3://bucket/data/p=x/ > - part-1.snappy.parquet > - s3://bucket/data/p=y/ > - part-2.snappy.parquet > - part-3.snappy.parquet > > {code:java} > # create new table > spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) > STORED AS parquet LOCATION 's3://bucket/new/'") > # load the saved table data from s3 specifying single partition value x > spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION > (p='x')") > spark.sql("select * from src").show() > # output: > # +---+---+---+ > # | c1| c2| p| > # +---+---+---+ > # +---+---+---+ > {code} > After running the `load data` command, and looking at the table I'm left with > no data loaded in. When checking S3 the data source we saved earlier is moved > under `s3://bucket/new/` oddly enough it also brought over the other > partitions along with it directory structure listed below. > - s3://bucket/new/ > - p=x/ > - p=x/ > - part-1.snappy.parquet > - p=y/ > - part-2.snappy.parquet > - part-3.snappy.parquet > Is this the intended behavior of loading the data in from a partitioned > parquet file? Is the previous file supposed to be moved/deleted from source > directory? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40287) Load Data using Spark by a single partition moves entire dataset under same location in S3
[ https://issues.apache.org/jira/browse/SPARK-40287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40287. -- Resolution: Not A Problem > Load Data using Spark by a single partition moves entire dataset under same > location in S3 > -- > > Key: SPARK-40287 > URL: https://issues.apache.org/jira/browse/SPARK-40287 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm experiencing an issue in PySpark when creating a hive table and loading > in the data to the table. So I'm using an Amazon s3 bucket as a data location > and I'm creating a table as parquet and trying to load data into that table > by a single partition, and I'm seeing some weird behavior. When selecting the > data location in s3 of a parquet file to load into my table. All of the data > is moved into the specified location in my create table command including the > partitions I didn't specify in the load data command. For example: > {code:java} > # create a data frame in pyspark with partitions > df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], > ["c1", "c2", "p"]) > # save it to S3 > df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/") > {code} > In the current state S3 should have a new folder `data` with two folders > which contain a parquet file in each partition. > > - s3://bucket/data/p=x/ > - part-1.snappy.parquet > - s3://bucket/data/p=y/ > - part-2.snappy.parquet > - part-3.snappy.parquet > > {code:java} > # create new table > spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) > STORED AS parquet LOCATION 's3://bucket/new/'") > # load the saved table data from s3 specifying single partition value x > spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION > (p='x')") > spark.sql("select * from src").show() > # output: > # +---+---+---+ > # | c1| c2| p| > # +---+---+---+ > # +---+---+---+ > {code} > After running the `load data` command, and looking at the table I'm left with > no data loaded in. When checking S3 the data source we saved earlier is moved > under `s3://bucket/new/` oddly enough it also brought over the other > partitions along with it directory structure listed below. > - s3://bucket/new/ > - p=x/ > - p=x/ > - part-1.snappy.parquet > - p=y/ > - part-2.snappy.parquet > - part-3.snappy.parquet > Is this the intended behavior of loading the data in from a partitioned > parquet file? Is the previous file supposed to be moved/deleted from source > directory? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40289) The result is strange when casting string to date in ORC reading via Schema Evolution
[ https://issues.apache.org/jira/browse/SPARK-40289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600197#comment-17600197 ] Hyukjin Kwon commented on SPARK-40289: -- Hm, why don't you read it as a string and cast explicitly? I believe this behaivour is inherited from ORC library itself > The result is strange when casting string to date in ORC reading via Schema > Evolution > - > > Key: SPARK-40289 > URL: https://issues.apache.org/jira/browse/SPARK-40289 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.1.1 > Environment: * Ubuntu 1804 LTS > * Spark 311 >Reporter: Jianbang Xian >Priority: Minor > > I created an ORC file by the code as follows. > {code:java} > val data = Seq( > ("", "2022-01-32"), // pay attention to this, null > ("", "9808-02-30"), // pay attention to this, 9808-02-29 > ("", "2022-06-31"), // pay attention to this, 2022-06-30 > ) > val cols = Seq("str", "date_str") > val df=spark.createDataFrame(data).toDF(cols:_*).repartition(1) > df.printSchema() > df.show(100) > df.write.mode("overwrite").orc("/tmp/orc/data.orc") > {code} > Please note that these three cases are invalid date. > And I read it via: > {code:java} > scala> var df = spark.read.schema("date_str date").orc("/tmp/orc/data.orc"); > df.show() > +--+ > | date_str| > +--+ > | null| > |9808-02-29| > |2022-06-30| > +--+{code} > Why is `2022-01-32` converted to `null`, while `9808-02-30` is converted to > `9808-02-29`? > Intuitively, they are invalid date, we should return 3 nulls. Is it a bug or > a feature? > > > *Background* > * I am working on the project: [https://github.com/NVIDIA/spark-rapids] > * And I am working on a feature, that is to support reading ORC file as an > cuDF (CUDA DataFrame). cuDF is an in-memory data-format of GPU. > * I need to follow the behaviors of ORC reading in CPU. Otherwise, the users > of spark-rapids will feel strange with the results. > * Therefore I want to know why those happpened. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40332) Implement `GroupBy.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40332: Description: We should implement `GroupBy.quantile` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html was: We should implement `DataFrame.compare` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html > Implement `GroupBy.quantile`. > - > > Key: SPARK-40332 > URL: https://issues.apache.org/jira/browse/SPARK-40332 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should implement `GroupBy.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40332) Implement `GroupBy.quantile`.
Haejoon Lee created SPARK-40332: --- Summary: Implement `GroupBy.quantile`. Key: SPARK-40332 URL: https://issues.apache.org/jira/browse/SPARK-40332 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `DataFrame.compare` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40298) shuffle data recovery on the reused PVCs no effect
[ https://issues.apache.org/jira/browse/SPARK-40298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40298: - Priority: Major (was: Blocker) > shuffle data recovery on the reused PVCs no effect > --- > > Key: SPARK-40298 > URL: https://issues.apache.org/jira/browse/SPARK-40298 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: 1662002808396.jpg, 1662002822097.jpg > > > I use spark3.2.2 to test the [ Support shuffle data recovery on the reused > PVCs (SPARK-35593) ] feature.I found that when shuffle read fails, data is > still read from source. > It can be confirmed that the pvc has been multiplexed by other pods, and the > Index and data data information has been sent > *This is my spark configuration information:* > --conf spark.driver.memory=5G > --conf spark.executor.memory=15G > --conf spark.executor.cores=1 > --conf spark.executor.instances=50 > --conf spark.sql.shuffle.partitions=50 > --conf spark.dynamicAllocation.enabled=false > --conf spark.kubernetes.driver.reusePersistentVolumeClaim=true > --conf spark.kubernetes.driver.ownPersistentVolumeClaim=true > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=gp2 > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=100Gi > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false > --conf spark.executorEnv.SPARK_EXECUTOR_DIRS=/tmp/data > --conf > spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO > --conf spark.kubernetes.executor.missingPodDetectDelta=10s > --conf spark.kubernetes.executor.apiPollingInterval=10s > --conf spark.shuffle.io.retryWait=60s > --conf spark.shuffle.io.maxRetries=5 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40298) shuffle data recovery on the reused PVCs no effect
[ https://issues.apache.org/jira/browse/SPARK-40298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40298. -- Resolution: Invalid > shuffle data recovery on the reused PVCs no effect > --- > > Key: SPARK-40298 > URL: https://issues.apache.org/jira/browse/SPARK-40298 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: 1662002808396.jpg, 1662002822097.jpg > > > I use spark3.2.2 to test the [ Support shuffle data recovery on the reused > PVCs (SPARK-35593) ] feature.I found that when shuffle read fails, data is > still read from source. > It can be confirmed that the pvc has been multiplexed by other pods, and the > Index and data data information has been sent > *This is my spark configuration information:* > --conf spark.driver.memory=5G > --conf spark.executor.memory=15G > --conf spark.executor.cores=1 > --conf spark.executor.instances=50 > --conf spark.sql.shuffle.partitions=50 > --conf spark.dynamicAllocation.enabled=false > --conf spark.kubernetes.driver.reusePersistentVolumeClaim=true > --conf spark.kubernetes.driver.ownPersistentVolumeClaim=true > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=gp2 > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=100Gi > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false > --conf spark.executorEnv.SPARK_EXECUTOR_DIRS=/tmp/data > --conf > spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO > --conf spark.kubernetes.executor.missingPodDetectDelta=10s > --conf spark.kubernetes.executor.apiPollingInterval=10s > --conf spark.shuffle.io.retryWait=60s > --conf spark.shuffle.io.maxRetries=5 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40298) shuffle data recovery on the reused PVCs no effect
[ https://issues.apache.org/jira/browse/SPARK-40298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600196#comment-17600196 ] Hyukjin Kwon commented on SPARK-40298: -- [~todd5167] for questions, it's better to interact in dev maling list. you'd get a better answer from there. > shuffle data recovery on the reused PVCs no effect > --- > > Key: SPARK-40298 > URL: https://issues.apache.org/jira/browse/SPARK-40298 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: 1662002808396.jpg, 1662002822097.jpg > > > I use spark3.2.2 to test the [ Support shuffle data recovery on the reused > PVCs (SPARK-35593) ] feature.I found that when shuffle read fails, data is > still read from source. > It can be confirmed that the pvc has been multiplexed by other pods, and the > Index and data data information has been sent > *This is my spark configuration information:* > --conf spark.driver.memory=5G > --conf spark.executor.memory=15G > --conf spark.executor.cores=1 > --conf spark.executor.instances=50 > --conf spark.sql.shuffle.partitions=50 > --conf spark.dynamicAllocation.enabled=false > --conf spark.kubernetes.driver.reusePersistentVolumeClaim=true > --conf spark.kubernetes.driver.ownPersistentVolumeClaim=true > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=gp2 > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=100Gi > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false > --conf spark.executorEnv.SPARK_EXECUTOR_DIRS=/tmp/data > --conf > spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO > --conf spark.kubernetes.executor.missingPodDetectDelta=10s > --conf spark.kubernetes.executor.apiPollingInterval=10s > --conf spark.shuffle.io.retryWait=60s > --conf spark.shuffle.io.maxRetries=5 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40299) java api calls the count() method to appear: java.lang.ArithmeticException: BigInteger would overflow supported range
[ https://issues.apache.org/jira/browse/SPARK-40299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40299. -- Resolution: Cannot Reproduce > java api calls the count() method to appear: java.lang.ArithmeticException: > BigInteger would overflow supported range > - > > Key: SPARK-40299 > URL: https://issues.apache.org/jira/browse/SPARK-40299 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.2 >Reporter: code1v5 >Priority: Major > > ive Session ID = a372ea31-ac98-4e01-9de3-dfb623df87a4 > 22/09/01 13:50:32 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > [Stage 0:> (0 + 8) / > 8]22/09/01 13:50:41 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, > hdp3-10-106, executor 6): java.lang.ArithmeticException: BigInteger would > overflow supported range > at java.math.BigInteger.reportOverflow(BigInteger.java:1084) > at java.math.BigInteger.pow(BigInteger.java:2391) > at java.math.BigDecimal.bigTenToThe(BigDecimal.java:3574) > at java.math.BigDecimal.bigMultiplyPowerTen(BigDecimal.java:3707) > at java.math.BigDecimal.setScale(BigDecimal.java:2448) > at java.math.BigDecimal.setScale(BigDecimal.java:2515) > at > org.apache.hadoop.hive.common.type.HiveDecimal.trim(HiveDecimal.java:241) > at > org.apache.hadoop.hive.common.type.HiveDecimal.normalize(HiveDecimal.java:252) > at > org.apache.hadoop.hive.common.type.HiveDecimal.create(HiveDecimal.java:83) > at > org.apache.hadoop.hive.serde2.lazy.LazyHiveDecimal.init(LazyHiveDecimal.java:79) > at > org.apache.hadoop.hive.serde2.lazy.LazyStruct.uncheckedGetField(LazyStruct.java:226) > at > org.apache.hadoop.hive.serde2.lazy.LazyStruct.getField(LazyStruct.java:202) > at > org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldData(LazySimpleStructObjectInspector.java:128) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:439) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:434) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 22/09/01 13:50:42 ERROR TaskSetManager: Task 5 in stage 0.0 failed 4 times; > aborting job > 22/09/01 13:50:42 WARN TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, > hdp2-10-105, executor 8): TaskKilled (Stage cancelled) > [Stage 0:> (0 + 6) / > 8]org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 > in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 > (TID 10, hdp3-10-106, executor 6): java.lang.ArithmeticException: BigInteger > would overflow supported range > at java.math.BigInteger.reportOverflow(BigInteger.java:1084) > at java.math.BigInteger.pow(BigInteger.java:2391) > at java.math.BigDecimal.bigTenToThe(BigDecimal.java:3574) > at java.math.BigDecimal.bigMultiplyPowerTen(BigDecimal.java:3707) > at java.math.BigDecimal.setScale(BigDecimal.java:2448) > at java.math.BigDecimal.setScale(BigDecimal.java:2515) > at > org.apache.hadoop.hive.common.type.HiveDecimal.trim(HiveDecimal.java:241) > at >
[jira] [Updated] (SPARK-40317) Improvement to JDBC predicate for queries involving joins
[ https://issues.apache.org/jira/browse/SPARK-40317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40317: - Component/s: SQL (was: Spark Core) > Improvement to JDBC predicate for queries involving joins > - > > Key: SPARK-40317 > URL: https://issues.apache.org/jira/browse/SPARK-40317 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.2 >Reporter: David Ahern >Priority: Major > > Current behaviour on tables involving joins seems to use a subquery as follows > > select * from > ( > select a, b, c from tbl1 > lj tbl2 on tbl1.col1 = tbl2.col1 > lj tbl3 on tbl1.col2 = tbl3.col2 > ) > where predicate = 1 > where predicate = 2 > where predicate = 3 > > More desirable would be > ( > select a, b, c from tbl1 where (predicate = 1, predicate = 2, etc) > lj tbl2 on tbl1.col1 = tbl2.col1 > lj tbl3 on tbl1.col2 = tbl3.col2 > ) > > to just do the join on the subset of data rather than joining all data then > filtering. Predicate pushdown usually only works on columns that have been > indexes. So even if the data isn't indexed, this would reduce amount of data > needing to be moved. In many cases better to do the join on DB side than > pulling everything into Spark. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40331) Java 11 should be used as the recommended running environment
Yang Jie created SPARK-40331: Summary: Java 11 should be used as the recommended running environment Key: SPARK-40331 URL: https://issues.apache.org/jira/browse/SPARK-40331 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.4.0 Reporter: Yang Jie Similar cases described in SPARK-40303 will not have negative effects if Java 11+ is used as runtime -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40271) Support list type for pyspark.sql.functions.lit
[ https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600193#comment-17600193 ] Apache Spark commented on SPARK-40271: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37798 > Support list type for pyspark.sql.functions.lit > --- > > Key: SPARK-40271 > URL: https://issues.apache.org/jira/browse/SPARK-40271 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > Currently, `pyspark.sql.functions.lit` doesn't support for Python list type > as below: > {code:python} > >>> df = spark.range(3).withColumn("c", lit([1,2,3])) > Traceback (most recent call last): > ... > : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] > The feature is not supported: Literal for '[1, 2, 3]' of class > java.util.ArrayList. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302) > at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100) > at org.apache.spark.sql.functions$.lit(functions.scala:125) > at org.apache.spark.sql.functions.lit(functions.scala) > at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:577) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.base/java.lang.Thread.run(Thread.java:833) > {code} > We should make it supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40330) Implement `Series.searchsorted`.
Haejoon Lee created SPARK-40330: --- Summary: Implement `Series.searchsorted`. Key: SPARK-40330 URL: https://issues.apache.org/jira/browse/SPARK-40330 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `Series.searchsorted` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.Series.searchsorted.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40327: Component/s: Pandas API on Spark (was: ps) > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40328) Implement `DataFrame.compare`.
[ https://issues.apache.org/jira/browse/SPARK-40328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40328: Component/s: Pandas API on Spark (was: ps) > Implement `DataFrame.compare`. > -- > > Key: SPARK-40328 > URL: https://issues.apache.org/jira/browse/SPARK-40328 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should implement `DataFrame.compare` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40329) Implement `Series.compare`.
Haejoon Lee created SPARK-40329: --- Summary: Implement `Series.compare`. Key: SPARK-40329 URL: https://issues.apache.org/jira/browse/SPARK-40329 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement `Series.compare` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.Series.compare.html#pandas.Series.compare -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40328) Implement `DataFrame.compare`.
[ https://issues.apache.org/jira/browse/SPARK-40328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40328: Description: We should implement `DataFrame.compare` for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html. was: We should implement DataFrame.compare for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html. > Implement `DataFrame.compare`. > -- > > Key: SPARK-40328 > URL: https://issues.apache.org/jira/browse/SPARK-40328 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should implement `DataFrame.compare` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40328) Implement `DataFrame.compare`.
Haejoon Lee created SPARK-40328: --- Summary: Implement `DataFrame.compare`. Key: SPARK-40328 URL: https://issues.apache.org/jira/browse/SPARK-40328 Project: Spark Issue Type: Sub-task Components: ps Affects Versions: 3.4.0 Reporter: Haejoon Lee We should implement DataFrame.compare for increasing pandas API coverage. pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40327: Description: Increasing the pandas API coverage for Apache Spark 3.4.0. (was: Increasing the pandas API coverage for Apache Spark 3.4.0, as we did for Apache Spark 3.3.0 in https://issues.apache.org/jira/browse/SPARK-36394.) > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: ps >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
Haejoon Lee created SPARK-40327: --- Summary: Increase pandas API coverage for pandas API on Spark Key: SPARK-40327 URL: https://issues.apache.org/jira/browse/SPARK-40327 Project: Spark Issue Type: Umbrella Components: ps Affects Versions: 3.4.0 Reporter: Haejoon Lee Increasing the pandas API coverage for Apache Spark 3.4.0, as we did for Apache Spark 3.3.0 in https://issues.apache.org/jira/browse/SPARK-36394. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40149: - Priority: Blocker (was: Major) > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Priority: Blocker > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40149: - Target Version/s: 3.4.0 > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Priority: Blocker > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40149: - Target Version/s: (was: 3.4.0) > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Priority: Blocker > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40149: - Priority: Major (was: Blocker) > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Priority: Major > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600111#comment-17600111 ] Apache Spark commented on SPARK-40142: -- User 'khalidmammadov' has created a pull request for this issue: https://github.com/apache/spark/pull/37797 > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600110#comment-17600110 ] Apache Spark commented on SPARK-40142: -- User 'khalidmammadov' has created a pull request for this issue: https://github.com/apache/spark/pull/37797 > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4
[ https://issues.apache.org/jira/browse/SPARK-40326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40326: Assignee: (was: Apache Spark) > upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 > to 2.13.4 > -- > > Key: SPARK-40326 > URL: https://issues.apache.org/jira/browse/SPARK-40326 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857] > [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4
[ https://issues.apache.org/jira/browse/SPARK-40326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40326: Assignee: Apache Spark > upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 > to 2.13.4 > -- > > Key: SPARK-40326 > URL: https://issues.apache.org/jira/browse/SPARK-40326 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Apache Spark >Priority: Major > > [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857] > [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4
[ https://issues.apache.org/jira/browse/SPARK-40326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600084#comment-17600084 ] Apache Spark commented on SPARK-40326: -- User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/37796 > upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 > to 2.13.4 > -- > > Key: SPARK-40326 > URL: https://issues.apache.org/jira/browse/SPARK-40326 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857] > [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4
Bjørn Jørgensen created SPARK-40326: --- Summary: upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4 Key: SPARK-40326 URL: https://issues.apache.org/jira/browse/SPARK-40326 Project: Spark Issue Type: Dependency upgrade Components: Build Affects Versions: 3.4.0 Reporter: Bjørn Jørgensen [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857] [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40321) Upgrade rocksdbjni to 7.5.3
[ https://issues.apache.org/jira/browse/SPARK-40321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-40321: Assignee: Yang Jie > Upgrade rocksdbjni to 7.5.3 > --- > > Key: SPARK-40321 > URL: https://issues.apache.org/jira/browse/SPARK-40321 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > https://github.com/facebook/rocksdb/releases -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40321) Upgrade rocksdbjni to 7.5.3
[ https://issues.apache.org/jira/browse/SPARK-40321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40321. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37783 [https://github.com/apache/spark/pull/37783] > Upgrade rocksdbjni to 7.5.3 > --- > > Key: SPARK-40321 > URL: https://issues.apache.org/jira/browse/SPARK-40321 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > https://github.com/facebook/rocksdb/releases -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39996) Upgrade postgresql to 42.5.0
[ https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-39996: Assignee: Bjørn Jørgensen > Upgrade postgresql to 42.5.0 > > > Key: SPARK-39996 > URL: https://issues.apache.org/jira/browse/SPARK-39996 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > > Security > - fix: > [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197] > Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so > as to prevent SQL injection. > - Previously, the column names for both key and data columns in the table > were copied as-is into the generated > SQL. This allowed a malicious table with column names that include > statement terminator to be parsed and > executed as multiple separate commands. > - Also adds a new test class ResultSetRefreshTest to verify this change. > - Reported by [Sho Kato](https://github.com/kato-sho) > [Release > note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39996) Upgrade postgresql to 42.5.0
[ https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-39996. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37762 [https://github.com/apache/spark/pull/37762] > Upgrade postgresql to 42.5.0 > > > Key: SPARK-39996 > URL: https://issues.apache.org/jira/browse/SPARK-39996 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.4.0 > > > Security > - fix: > [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197] > Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so > as to prevent SQL injection. > - Previously, the column names for both key and data columns in the table > were copied as-is into the generated > SQL. This allowed a malicious table with column names that include > statement terminator to be parsed and > executed as multiple separate commands. > - Also adds a new test class ResultSetRefreshTest to verify this change. > - Reported by [Sho Kato](https://github.com/kato-sho) > [Release > note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39996) Upgrade postgresql to 42.5.0
[ https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-39996: - Component/s: Tests > Upgrade postgresql to 42.5.0 > > > Key: SPARK-39996 > URL: https://issues.apache.org/jira/browse/SPARK-39996 > Project: Spark > Issue Type: Dependency upgrade > Components: Build, Tests >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Minor > Fix For: 3.4.0 > > > Security > - fix: > [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197] > Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so > as to prevent SQL injection. > - Previously, the column names for both key and data columns in the table > were copied as-is into the generated > SQL. This allowed a malicious table with column names that include > statement terminator to be parsed and > executed as multiple separate commands. > - Also adds a new test class ResultSetRefreshTest to verify this change. > - Reported by [Sho Kato](https://github.com/kato-sho) > [Release > note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39996) Upgrade postgresql to 42.5.0
[ https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-39996: - Priority: Minor (was: Major) > Upgrade postgresql to 42.5.0 > > > Key: SPARK-39996 > URL: https://issues.apache.org/jira/browse/SPARK-39996 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Minor > Fix For: 3.4.0 > > > Security > - fix: > [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197] > Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so > as to prevent SQL injection. > - Previously, the column names for both key and data columns in the table > were copied as-is into the generated > SQL. This allowed a malicious table with column names that include > statement terminator to be parsed and > executed as multiple separate commands. > - Also adds a new test class ResultSetRefreshTest to verify this change. > - Reported by [Sho Kato](https://github.com/kato-sho) > [Release > note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40325) Support of Columnar result(ColumnarBatch) in org.apache.spark.sql.Dataset flatMap, transform, etc
Igor Suhorukov created SPARK-40325: -- Summary: Support of Columnar result(ColumnarBatch) in org.apache.spark.sql.Dataset flatMap, transform, etc Key: SPARK-40325 URL: https://issues.apache.org/jira/browse/SPARK-40325 Project: Spark Issue Type: New Feature Components: Java API, Spark Core Affects Versions: 3.3.0 Reporter: Igor Suhorukov Sometimes result of data transformation in JVM program available from native code in Apache Arrow columnar data format. Current Dataset API require unnecessary data transform from columnar format wrapper into row with additional allocation on JVM heap. In this proposed feature I ask for propagation of columnar data in DatasetAPI without unnecessary InternalRow->Row->InternalRow conversion. Current solution use [ColumnarBatch wrapper|https://github.com/igor-suhorukov/spark3/blob/master/src/main/java/com/github/igorsuhorukov/arrow/spark/ArrowDataIterator.java] on top of ArrowColumnVector and rowExpressionEncoder.createDeserializer() to transform data [into Row|https://github.com/igor-suhorukov/spark3/blob/c655d4b6058fdd4529aa59093edfe2333d96fb05/src/main/java/com/github/igorsuhorukov/arrow/spark/ArrowDataIterator.java#L53] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40324) Provide a query context of ParseException
[ https://issues.apache.org/jira/browse/SPARK-40324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40324: Assignee: Apache Spark (was: Max Gekk) > Provide a query context of ParseException > - > > Key: SPARK-40324 > URL: https://issues.apache.org/jira/browse/SPARK-40324 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Extends the exception ParseException and add a queryContext into it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40324) Provide a query context of ParseException
[ https://issues.apache.org/jira/browse/SPARK-40324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40324: Assignee: Max Gekk (was: Apache Spark) > Provide a query context of ParseException > - > > Key: SPARK-40324 > URL: https://issues.apache.org/jira/browse/SPARK-40324 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Extends the exception ParseException and add a queryContext into it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40324) Provide a query context of ParseException
[ https://issues.apache.org/jira/browse/SPARK-40324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600047#comment-17600047 ] Apache Spark commented on SPARK-40324: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37794 > Provide a query context of ParseException > - > > Key: SPARK-40324 > URL: https://issues.apache.org/jira/browse/SPARK-40324 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Extends the exception ParseException and add a queryContext into it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40324) Provide a query context of ParseException
[ https://issues.apache.org/jira/browse/SPARK-40324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600048#comment-17600048 ] Apache Spark commented on SPARK-40324: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37794 > Provide a query context of ParseException > - > > Key: SPARK-40324 > URL: https://issues.apache.org/jira/browse/SPARK-40324 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Extends the exception ParseException and add a queryContext into it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40324) Provide a query context of ParseException
Max Gekk created SPARK-40324: Summary: Provide a query context of ParseException Key: SPARK-40324 URL: https://issues.apache.org/jira/browse/SPARK-40324 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk Extends the exception ParseException and add a queryContext into it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40251) Upgrade dev.ludovic.netlib from 2.2.1 to 3.0.2
[ https://issues.apache.org/jira/browse/SPARK-40251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600035#comment-17600035 ] Apache Spark commented on SPARK-40251: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/37791 > Upgrade dev.ludovic.netlib from 2.2.1 to 3.0.2 > -- > > Key: SPARK-40251 > URL: https://issues.apache.org/jira/browse/SPARK-40251 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > https://github.com/luhenry/netlib/compare/v2.2.1...v3.0.2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.
[ https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600032#comment-17600032 ] Apache Spark commented on SPARK-40288: -- User 'hgs19921112' has created a pull request for this issue: https://github.com/apache/spark/pull/37790 > After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should > applied to avoid attribute missing when use complex expression. > -- > > Key: SPARK-40288 > URL: https://issues.apache.org/jira/browse/SPARK-40288 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 > Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0 >Reporter: hgs >Priority: Minor > > {{--table}} > {{create}} {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} > {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}} > {{--data}} > {{insert}} {{overwrite }}{{table}} {{miss_expr > }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}} > {{--failure sql}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n > }}{{{}from{}}}{{{}({}}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage > }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}} > {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}} > --error stack > {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in > [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) > 100 else 200#12]}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}} > {{at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.
[ https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600031#comment-17600031 ] Apache Spark commented on SPARK-40288: -- User 'hgs19921112' has created a pull request for this issue: https://github.com/apache/spark/pull/37790 > After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should > applied to avoid attribute missing when use complex expression. > -- > > Key: SPARK-40288 > URL: https://issues.apache.org/jira/browse/SPARK-40288 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 > Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0 >Reporter: hgs >Priority: Minor > > {{--table}} > {{create}} {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} > {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}} > {{--data}} > {{insert}} {{overwrite }}{{table}} {{miss_expr > }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}} > {{--failure sql}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n > }}{{{}from{}}}{{{}({}}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage > }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}} > {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}} > --error stack > {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in > [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) > 100 else 200#12]}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}} > {{at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.
[ https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600029#comment-17600029 ] Apache Spark commented on SPARK-40288: -- User 'hgs19921112' has created a pull request for this issue: https://github.com/apache/spark/pull/37788 > After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should > applied to avoid attribute missing when use complex expression. > -- > > Key: SPARK-40288 > URL: https://issues.apache.org/jira/browse/SPARK-40288 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 > Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0 >Reporter: hgs >Priority: Minor > > {{--table}} > {{create}} {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} > {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}} > {{--data}} > {{insert}} {{overwrite }}{{table}} {{miss_expr > }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}} > {{--failure sql}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n > }}{{{}from{}}}{{{}({}}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage > }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}} > {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}} > --error stack > {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in > [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) > 100 else 200#12]}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}} > {{at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.
[ https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600028#comment-17600028 ] Apache Spark commented on SPARK-40288: -- User 'hgs19921112' has created a pull request for this issue: https://github.com/apache/spark/pull/37788 > After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should > applied to avoid attribute missing when use complex expression. > -- > > Key: SPARK-40288 > URL: https://issues.apache.org/jira/browse/SPARK-40288 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 > Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0 >Reporter: hgs >Priority: Minor > > {{--table}} > {{create}} {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} > {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}} > {{--data}} > {{insert}} {{overwrite }}{{table}} {{miss_expr > }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}} > {{--failure sql}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n > }}{{{}from{}}}{{{}({}}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage > }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}} > {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}} > --error stack > {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in > [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) > 100 else 200#12]}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}} > {{at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39796) Add a regexp_extract variant which returns an array of all the matched capture groups
[ https://issues.apache.org/jira/browse/SPARK-39796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600021#comment-17600021 ] Augustine Theodore Prince commented on SPARK-39796: --- Hi [~planga82] , {code:java} df.withColumn("g1", regexp_extract('a, regex, 1)).withColumn("g2", regexp_extract('a, regex, 2)).show {code} In the above statement, the regular expression is compiled and processed twice. Wouldn't it be more performant to compile once and extract all the groups so that the expression is processed only once? > Add a regexp_extract variant which returns an array of all the matched > capture groups > - > > Key: SPARK-39796 > URL: https://issues.apache.org/jira/browse/SPARK-39796 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.2 >Reporter: Augustine Theodore Prince >Priority: Minor > Labels: regexp_extract, regexp_extract_all, regexp_replace > > > regexp_extract only returns a single matched group. In a lot of cases we need > to parse the entire string and get all the groups and for that we'll need to > call it as many times as there are groups. The regexp_extract_all function > doesn't solve this problem as it only works if all the groups have the same > regex pattern. > > _Example:_ > I will provide an example and the current workaround that I use to solve this, > If I have the following dataframe and I would like to match the column 'a' > with this pattern > {code:java} > "([A-Za-z]+), [A-Za-z]+, (\\d+)"{code} > |a| > |Hello, World, 1234| > |Good, bye, friend| > > My expected output is as follows: > |a|extracted_a| > |Hello, World, 1234|[Hello, 1234]| > |Good, bye, friend|[]| > > However, to achieve this I have to take the following approach which seems > very hackish. > 1. Use regexp_replace to create a temporary string built using the extracted > groups: > {code:java} > df.withColumn("extr" , F.regexp_replace("a", "([A-Za-z]+), [A-Za-z]+, > (\\d+)", "$1_$2")){code} > A side effect of regexp_replace is that if the regex fails to match the > entire string is returned. > > |a|extracted_a| > |Hello, World, 1234|Hello_1234| > |Good, bye, friend|Good, bye, friend| > 2. So, to achieve the desired result, a check has to be done to prune the > rows that did not match with the pattern : > {code:java} > df = df.withColumn("extracted_a" , F.when(F.col("extracted_a")==F.col("a") , > None).otherwise(F.col("extracted_a"))){code} > > to get the following intermediate dataframe, > |a|extracted_a| > |Hello, World, 1234|Hello_1234| > |Good, bye, friend|null| > > 3. Before finally splitting the column 'extracted_a' based on underscores > {code:java} > df = df.withColumn("extracted_a" , F.split("extracted_a" , "[_]")){code} > which results in the desired result: > > > |a|extracted_a > | > |Hello, World, 1234|[Hello, 1234]| > |Good, bye, friend|null| > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40301) Add parameter validation in pyspark.rdd
[ https://issues.apache.org/jira/browse/SPARK-40301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-40301: - Assignee: Ruifeng Zheng > Add parameter validation in pyspark.rdd > --- > > Key: SPARK-40301 > URL: https://issues.apache.org/jira/browse/SPARK-40301 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40301) Add parameter validation in pyspark.rdd
[ https://issues.apache.org/jira/browse/SPARK-40301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-40301. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37752 [https://github.com/apache/spark/pull/37752] > Add parameter validation in pyspark.rdd > --- > > Key: SPARK-40301 > URL: https://issues.apache.org/jira/browse/SPARK-40301 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40305) Implement Groupby.sem
[ https://issues.apache.org/jira/browse/SPARK-40305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-40305: - Assignee: Ruifeng Zheng > Implement Groupby.sem > - > > Key: SPARK-40305 > URL: https://issues.apache.org/jira/browse/SPARK-40305 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40305) Implement Groupby.sem
[ https://issues.apache.org/jira/browse/SPARK-40305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-40305. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37756 [https://github.com/apache/spark/pull/37756] > Implement Groupby.sem > - > > Key: SPARK-40305 > URL: https://issues.apache.org/jira/browse/SPARK-40305 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org