[jira] [Updated] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-04 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated SPARK-40320:
-
Description: 
*Reproduce step:*
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
code to make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("My Exception error")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

*Root Cause:*

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method 
`dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process 
 is active but the  communication thread is no longer working ( please see  
`MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, so executor 
doesn't receive any message)

Some ideas:
I think it is very hard to know what happened here unless we check in the code. 
The Executor is active but it can't do anything. We will wonder if the driver 
is broken or the Executor problem.  I think at least the Executor status 
shouldn't be active here or the Executor can exitExecutor (kill itself)

 

  was:
*Reproduce step:*
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
code to make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("My Exception error")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

*Root Cause:*

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method 
`dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process 
 is active but the  communication thread is no longer working ( please see  
`MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, so executor 
doesn't receive any message)

Some ideas:
I think it is very hard to know what happened here unless we check in the code. 
The Executor is active but it can't do anything. We will wonder if the driver 
is broken or the Executor problem.  I think at least the Executor status 
shouldn't be active here or the Executor can exitExecutor (kill itself)

 


> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
> code to make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("My Exception error")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method 
> `dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM 
> 

[jira] [Updated] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-04 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated SPARK-40320:
-
Description: 
*Reproduce step:*
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
code to make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("My Exception error")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

*Root Cause:*

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method 
`dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process 
 is active but the  communication thread is no longer working ( please see  
`MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, so executor 
doesn't receive any message)

Some ideas:
I think it is very hard to know what happened here unless we check in the code. 
The Executor is active but it can't do anything. We will wonder if the driver 
is broken or the Executor problem.  I think at least the Executor status 
shouldn't be active here or the Executor can exitExecutor (kill itself)

 

  was:
*Reproduce step:*
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("LCL my Exception error2")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

*Root Cause:*

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method 
`dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process 
 is active but the  communication thread is no longer working ( please see  
`MessageLoop#receiveLoopRunnable` , `receiveLoop()` while was broken here, so 
executor doesn't receive any message)

Some ideas:
I think it is very hard to know what happened here unless we check in the code. 
The Executor is active but it can't do anything. We will wonder if the driver 
is broken or the Executor problem.  I think at least the Executor status 
shouldn't be active here or the Executor can exitExecutor (kill itself)

 


> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
> code to make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("My Exception error")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in 
> method `dealWithFatalError` . Actually the  

[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-04 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600203#comment-17600203
 ] 

Ruifeng Zheng commented on SPARK-40327:
---

cc [~yikunkero] If you want to have a try, feel free to take over some of those 
subtasks, thanks in advance!

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-40298) shuffle data recovery on the reused PVCs no effect

2022-09-04 Thread todd (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

todd reopened SPARK-40298:
--

> shuffle data recovery on the reused PVCs  no effect
> ---
>
> Key: SPARK-40298
> URL: https://issues.apache.org/jira/browse/SPARK-40298
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: 1662002808396.jpg, 1662002822097.jpg
>
>
> I use spark3.2.2 to test the [ Support shuffle data recovery on the reused 
> PVCs (SPARK-35593) ] feature.I found that when shuffle read fails, data is 
> still read from source.
> It can be confirmed that the pvc has been multiplexed by other pods, and the 
> Index and data data information has been sent
> *This is my spark configuration information:*
> --conf spark.driver.memory=5G 
> --conf spark.executor.memory=15G 
> --conf spark.executor.cores=1
> --conf spark.executor.instances=50
> --conf spark.sql.shuffle.partitions=50
> --conf spark.dynamicAllocation.enabled=false
> --conf spark.kubernetes.driver.reusePersistentVolumeClaim=true
> --conf spark.kubernetes.driver.ownPersistentVolumeClaim=true
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=gp2
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=100Gi
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false
> --conf spark.executorEnv.SPARK_EXECUTOR_DIRS=/tmp/data
> --conf 
> spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO
> --conf spark.kubernetes.executor.missingPodDetectDelta=10s
> --conf spark.kubernetes.executor.apiPollingInterval=10s
> --conf spark.shuffle.io.retryWait=60s
> --conf spark.shuffle.io.maxRetries=5
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40349) Implement `RollingGroupby.sem`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40349:
---

 Summary: Implement `RollingGroupby.sem`.
 Key: SPARK-40349
 URL: https://issues.apache.org/jira/browse/SPARK-40349
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `RollingGroupby.sem` for increasing pandas API coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40348) Implement `RollingGroupby.quantile`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40348:
---

 Summary: Implement `RollingGroupby.quantile`.
 Key: SPARK-40348
 URL: https://issues.apache.org/jira/browse/SPARK-40348
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `RollingGroupby.quantile` for increasing pandas API 
coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40347) Implement `RollingGroupby.median`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40347:
---

 Summary: Implement `RollingGroupby.median`.
 Key: SPARK-40347
 URL: https://issues.apache.org/jira/browse/SPARK-40347
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `RollingGroupby.median` for increasing pandas API coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40346) Implement `ExpandingGroupby.sem`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40346:
---

 Summary: Implement `ExpandingGroupby.sem`.
 Key: SPARK-40346
 URL: https://issues.apache.org/jira/browse/SPARK-40346
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `ExpandingGroupby.sem` for increasing pandas API coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40345) Implement `ExpandingGroupby.quantile`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40345:
---

 Summary: Implement `ExpandingGroupby.quantile`.
 Key: SPARK-40345
 URL: https://issues.apache.org/jira/browse/SPARK-40345
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `ExpandingGroupby.quantile` for increasing pandas API 
coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40344) Implement `ExpandingGroupby.median`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40344:
---

 Summary: Implement `ExpandingGroupby.median`.
 Key: SPARK-40344
 URL: https://issues.apache.org/jira/browse/SPARK-40344
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `ExpandingGroupby.median` for increasing pandas API 
coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40331) Java 11 should be used as the recommended running environment

2022-09-04 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-40331:
-
Description: 
Similar cases described in SPARK-40303  will not have negative effects if Java 
11+ is used as runtime

 

 

  was:
Similar cases described in SPARK-40303  will not have negative effects if Java 
11+ is used as runtime

 


> Java 11 should be used as the recommended running environment
> -
>
> Key: SPARK-40331
> URL: https://issues.apache.org/jira/browse/SPARK-40331
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> Similar cases described in SPARK-40303  will not have negative effects if 
> Java 11+ is used as runtime
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40343) Implement `Rolling.sem`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40343:
---

 Summary: Implement `Rolling.sem`.
 Key: SPARK-40343
 URL: https://issues.apache.org/jira/browse/SPARK-40343
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `Rolling.sem` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.sem.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40342) Implement `Rolling.quantile`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40342:
---

 Summary: Implement `Rolling.quantile`.
 Key: SPARK-40342
 URL: https://issues.apache.org/jira/browse/SPARK-40342
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `Rolling.quantile` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40340) Implement `Expanding.sem`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40340:
---

 Summary: Implement `Expanding.sem`.
 Key: SPARK-40340
 URL: https://issues.apache.org/jira/browse/SPARK-40340
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `Expanding.sem` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.sem.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40341) Implement `Rolling.median`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40341:
---

 Summary: Implement `Rolling.median`.
 Key: SPARK-40341
 URL: https://issues.apache.org/jira/browse/SPARK-40341
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `Rolling.median` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.median.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40338) Implement `Expanding.median`.

2022-09-04 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40338:

Description: 
We should implement `Expanding.median` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.median.html

> Implement `Expanding.median`.
> -
>
> Key: SPARK-40338
> URL: https://issues.apache.org/jira/browse/SPARK-40338
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should implement `Expanding.median` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.median.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40339) Implement `Expanding.quantile`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40339:
---

 Summary: Implement `Expanding.quantile`.
 Key: SPARK-40339
 URL: https://issues.apache.org/jira/browse/SPARK-40339
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `Expanding.quantile` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40337) Implement `SeriesGroupBy.describe`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40337:
---

 Summary: Implement `SeriesGroupBy.describe`.
 Key: SPARK-40337
 URL: https://issues.apache.org/jira/browse/SPARK-40337
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `SeriesGroupBy.describe` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.describe.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40338) Implement `Expanding.median`.

2022-09-04 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40338:

Summary: Implement `Expanding.median`.  (was: Imple)

> Implement `Expanding.median`.
> -
>
> Key: SPARK-40338
> URL: https://issues.apache.org/jira/browse/SPARK-40338
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40338) Imple

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40338:
---

 Summary: Imple
 Key: SPARK-40338
 URL: https://issues.apache.org/jira/browse/SPARK-40338
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40336) Implement `DataFrameGroupBy.cov`.

2022-09-04 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40336:

Summary: Implement `DataFrameGroupBy.cov`.  (was: Implement 
`DataFrame.cov`.)

> Implement `DataFrameGroupBy.cov`.
> -
>
> Key: SPARK-40336
> URL: https://issues.apache.org/jira/browse/SPARK-40336
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should implement `DataFrameGroupBy.cov` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.cov.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40336) Implement `DataFrame.cov`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40336:
---

 Summary: Implement `DataFrame.cov`.
 Key: SPARK-40336
 URL: https://issues.apache.org/jira/browse/SPARK-40336
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `DataFrameGroupBy.cov` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.cov.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40292) arrays_zip output unexpected alias column names

2022-09-04 Thread Ivan Sadikov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600198#comment-17600198
 ] 

Ivan Sadikov commented on SPARK-40292:
--

I will take a look.

> arrays_zip output unexpected alias column names
> ---
>
> Key: SPARK-40292
> URL: https://issues.apache.org/jira/browse/SPARK-40292
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Linhong Liu
>Priority: Major
>
> For the below query:
> {code:sql}
> with q as (
>   select
>     named_struct(
>       'my_array', array(named_struct('x', 1, 'y', 2))
>     ) as my_struct
> )
> select
>   arrays_zip(my_struct.my_array)
> from
>   q {code}
> The latest spark gives the below schema, the field name "my_array" was 
> changed to "0"
> {code:java}
> root
>  |-- arrays_zip(my_struct.my_array): array (nullable = true)
>  |    |-- element: struct (containsNull = false)
>  |    |    |-- 0: struct (nullable = true)
>  |    |    |    |-- x: integer (nullable = true)
>  |    |    |    |-- y: integer (nullable = true){code}
> While Spark 3.1 gives the expected result
> {code:java}
> root
>  |-- arrays_zip(my_struct.my_array): array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- my_array: struct (nullable = true)
>  ||||-- x: integer (nullable = true)
>  ||||-- y: integer (nullable = true)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40335) Implement `DataFrameGroupBy.corr`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40335:
---

 Summary: Implement `DataFrameGroupBy.corr`.
 Key: SPARK-40335
 URL: https://issues.apache.org/jira/browse/SPARK-40335
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `DataFrameGroupBy.corr` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.corr.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40286) Load Data from S3 deletes data source file

2022-09-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40286.
--
Target Version/s:   (was: 3.2.1)
  Resolution: Invalid

> Load Data from S3 deletes data source file
> --
>
> Key: SPARK-40286
> URL: https://issues.apache.org/jira/browse/SPARK-40286
> Project: Spark
>  Issue Type: Question
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello, 
> I'm using spark to [load 
> data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into 
> a hive table through Pyspark, and when I load data from a path in Amazon S3, 
> the original file is getting wiped from the Directory. The file is found, and 
> is populating the table with data. I also tried to add the `Local` clause but 
> that throws an error when looking for the file. When looking through the 
> documentation it doesn't explicitly state that this is the intended behavior.
> Thanks in advance!
> {code:java}
> spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
> spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE 
> src"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40334) Implement `GroupBy.prod`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40334:
---

 Summary: Implement `GroupBy.prod`.
 Key: SPARK-40334
 URL: https://issues.apache.org/jira/browse/SPARK-40334
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `GroupBy.prod` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.prod.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40333) Implement `GroupBy.nth`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40333:
---

 Summary: Implement `GroupBy.nth`.
 Key: SPARK-40333
 URL: https://issues.apache.org/jira/browse/SPARK-40333
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `DataFrame.compare` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40333) Implement `GroupBy.nth`.

2022-09-04 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40333:

Description: 
We should implement `GroupBy.nth` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html

  was:
We should implement `DataFrame.compare` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html


> Implement `GroupBy.nth`.
> 
>
> Key: SPARK-40333
> URL: https://issues.apache.org/jira/browse/SPARK-40333
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should implement `GroupBy.nth` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-40287) Load Data using Spark by a single partition moves entire dataset under same location in S3

2022-09-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-40287:
--

> Load Data using Spark by a single partition moves entire dataset under same 
> location in S3
> --
>
> Key: SPARK-40287
> URL: https://issues.apache.org/jira/browse/SPARK-40287
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello,
> I'm experiencing an issue in PySpark when creating a hive table and loading 
> in the data to the table. So I'm using an Amazon s3 bucket as a data location 
> and I'm creating a table as parquet and trying to load data into that table 
> by a single partition, and I'm seeing some weird behavior. When selecting the 
> data location in s3 of a parquet file to load into my table. All of the data 
> is moved into the specified location in my create table command including the 
> partitions I didn't specify in the load data command. For example:
> {code:java}
> # create a data frame in pyspark with partitions
> df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], 
> ["c1", "c2", "p"])
> # save it to S3
> df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/")
> {code}
> In the current state S3 should have a new folder `data` with two folders 
> which contain a parquet file in each partition. 
>   
>  - s3://bucket/data/p=x/
>     - part-1.snappy.parquet
>  - s3://bucket/data/p=y/
>     - part-2.snappy.parquet
>     - part-3.snappy.parquet
>  
> {code:java}
> # create new table
> spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) 
> STORED AS parquet LOCATION 's3://bucket/new/'")
> # load the saved table data from s3 specifying single partition value x
> spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION 
> (p='x')")
> spark.sql("select * from src").show()
> # output: 
> # +---+---+---+
> # | c1| c2|  p|
> # +---+---+---+
> # +---+---+---+
> {code}
> After running the `load data` command, and looking at the table I'm left with 
> no data loaded in. When checking S3 the data source we saved earlier is moved 
> under `s3://bucket/new/` oddly enough it also brought over the other 
> partitions along with it directory structure listed below. 
> - s3://bucket/new/
>     - p=x/
>         - p=x/
>             - part-1.snappy.parquet
>         - p=y/
>             - part-2.snappy.parquet
>             - part-3.snappy.parquet
> Is this the intended behavior of loading the data in from a partitioned 
> parquet file? Is the previous file supposed to be moved/deleted from source 
> directory? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40287) Load Data using Spark by a single partition moves entire dataset under same location in S3

2022-09-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40287.
--
Resolution: Invalid

> Load Data using Spark by a single partition moves entire dataset under same 
> location in S3
> --
>
> Key: SPARK-40287
> URL: https://issues.apache.org/jira/browse/SPARK-40287
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello,
> I'm experiencing an issue in PySpark when creating a hive table and loading 
> in the data to the table. So I'm using an Amazon s3 bucket as a data location 
> and I'm creating a table as parquet and trying to load data into that table 
> by a single partition, and I'm seeing some weird behavior. When selecting the 
> data location in s3 of a parquet file to load into my table. All of the data 
> is moved into the specified location in my create table command including the 
> partitions I didn't specify in the load data command. For example:
> {code:java}
> # create a data frame in pyspark with partitions
> df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], 
> ["c1", "c2", "p"])
> # save it to S3
> df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/")
> {code}
> In the current state S3 should have a new folder `data` with two folders 
> which contain a parquet file in each partition. 
>   
>  - s3://bucket/data/p=x/
>     - part-1.snappy.parquet
>  - s3://bucket/data/p=y/
>     - part-2.snappy.parquet
>     - part-3.snappy.parquet
>  
> {code:java}
> # create new table
> spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) 
> STORED AS parquet LOCATION 's3://bucket/new/'")
> # load the saved table data from s3 specifying single partition value x
> spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION 
> (p='x')")
> spark.sql("select * from src").show()
> # output: 
> # +---+---+---+
> # | c1| c2|  p|
> # +---+---+---+
> # +---+---+---+
> {code}
> After running the `load data` command, and looking at the table I'm left with 
> no data loaded in. When checking S3 the data source we saved earlier is moved 
> under `s3://bucket/new/` oddly enough it also brought over the other 
> partitions along with it directory structure listed below. 
> - s3://bucket/new/
>     - p=x/
>         - p=x/
>             - part-1.snappy.parquet
>         - p=y/
>             - part-2.snappy.parquet
>             - part-3.snappy.parquet
> Is this the intended behavior of loading the data in from a partitioned 
> parquet file? Is the previous file supposed to be moved/deleted from source 
> directory? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40287) Load Data using Spark by a single partition moves entire dataset under same location in S3

2022-09-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40287.
--
Resolution: Not A Problem

> Load Data using Spark by a single partition moves entire dataset under same 
> location in S3
> --
>
> Key: SPARK-40287
> URL: https://issues.apache.org/jira/browse/SPARK-40287
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello,
> I'm experiencing an issue in PySpark when creating a hive table and loading 
> in the data to the table. So I'm using an Amazon s3 bucket as a data location 
> and I'm creating a table as parquet and trying to load data into that table 
> by a single partition, and I'm seeing some weird behavior. When selecting the 
> data location in s3 of a parquet file to load into my table. All of the data 
> is moved into the specified location in my create table command including the 
> partitions I didn't specify in the load data command. For example:
> {code:java}
> # create a data frame in pyspark with partitions
> df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], 
> ["c1", "c2", "p"])
> # save it to S3
> df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/")
> {code}
> In the current state S3 should have a new folder `data` with two folders 
> which contain a parquet file in each partition. 
>   
>  - s3://bucket/data/p=x/
>     - part-1.snappy.parquet
>  - s3://bucket/data/p=y/
>     - part-2.snappy.parquet
>     - part-3.snappy.parquet
>  
> {code:java}
> # create new table
> spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) 
> STORED AS parquet LOCATION 's3://bucket/new/'")
> # load the saved table data from s3 specifying single partition value x
> spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION 
> (p='x')")
> spark.sql("select * from src").show()
> # output: 
> # +---+---+---+
> # | c1| c2|  p|
> # +---+---+---+
> # +---+---+---+
> {code}
> After running the `load data` command, and looking at the table I'm left with 
> no data loaded in. When checking S3 the data source we saved earlier is moved 
> under `s3://bucket/new/` oddly enough it also brought over the other 
> partitions along with it directory structure listed below. 
> - s3://bucket/new/
>     - p=x/
>         - p=x/
>             - part-1.snappy.parquet
>         - p=y/
>             - part-2.snappy.parquet
>             - part-3.snappy.parquet
> Is this the intended behavior of loading the data in from a partitioned 
> parquet file? Is the previous file supposed to be moved/deleted from source 
> directory? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40289) The result is strange when casting string to date in ORC reading via Schema Evolution

2022-09-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600197#comment-17600197
 ] 

Hyukjin Kwon commented on SPARK-40289:
--

Hm, why don't you read it as a string and cast explicitly? I believe this 
behaivour is inherited from ORC library itself

> The result is strange when casting string to date in ORC reading via Schema 
> Evolution
> -
>
> Key: SPARK-40289
> URL: https://issues.apache.org/jira/browse/SPARK-40289
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.1.1
> Environment: * Ubuntu 1804 LTS
>  * Spark 311
>Reporter: Jianbang Xian
>Priority: Minor
>
> I created an ORC file by the code as follows.
> {code:java}
> val data = Seq(
>     ("", "2022-01-32"),  // pay attention to this, null
>     ("", "9808-02-30"),  // pay attention to this, 9808-02-29
>     ("", "2022-06-31"),  // pay attention to this, 2022-06-30
> )
> val cols = Seq("str", "date_str")
> val df=spark.createDataFrame(data).toDF(cols:_*).repartition(1)
> df.printSchema()
> df.show(100)
> df.write.mode("overwrite").orc("/tmp/orc/data.orc")
> {code}
> Please note that these three cases are invalid date.
> And I read it via:
> {code:java}
> scala> var df = spark.read.schema("date_str date").orc("/tmp/orc/data.orc"); 
> df.show()
> +--+
> |  date_str|
> +--+
> |      null|
> |9808-02-29|
> |2022-06-30|
> +--+{code}
> Why is `2022-01-32` converted to `null`, while `9808-02-30` is converted to 
> `9808-02-29`?
> Intuitively, they are invalid date, we should return 3 nulls. Is it a bug or 
> a feature?
>  
>  
> *Background*
>  * I am working on the project: [https://github.com/NVIDIA/spark-rapids]
>  * And I am working on a feature, that is to support reading ORC file as an 
> cuDF (CUDA DataFrame). cuDF is an in-memory data-format of GPU.
>  * I need to follow the behaviors of ORC reading in CPU. Otherwise, the users 
> of spark-rapids will feel strange with the results.
>  * Therefore I want to know why those happpened.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40332) Implement `GroupBy.quantile`.

2022-09-04 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40332:

Description: 
We should implement `GroupBy.quantile` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html

  was:
We should implement `DataFrame.compare` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html


> Implement `GroupBy.quantile`.
> -
>
> Key: SPARK-40332
> URL: https://issues.apache.org/jira/browse/SPARK-40332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should implement `GroupBy.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40332) Implement `GroupBy.quantile`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40332:
---

 Summary: Implement `GroupBy.quantile`.
 Key: SPARK-40332
 URL: https://issues.apache.org/jira/browse/SPARK-40332
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `DataFrame.compare` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40298) shuffle data recovery on the reused PVCs no effect

2022-09-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40298:
-
Priority: Major  (was: Blocker)

> shuffle data recovery on the reused PVCs  no effect
> ---
>
> Key: SPARK-40298
> URL: https://issues.apache.org/jira/browse/SPARK-40298
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: 1662002808396.jpg, 1662002822097.jpg
>
>
> I use spark3.2.2 to test the [ Support shuffle data recovery on the reused 
> PVCs (SPARK-35593) ] feature.I found that when shuffle read fails, data is 
> still read from source.
> It can be confirmed that the pvc has been multiplexed by other pods, and the 
> Index and data data information has been sent
> *This is my spark configuration information:*
> --conf spark.driver.memory=5G 
> --conf spark.executor.memory=15G 
> --conf spark.executor.cores=1
> --conf spark.executor.instances=50
> --conf spark.sql.shuffle.partitions=50
> --conf spark.dynamicAllocation.enabled=false
> --conf spark.kubernetes.driver.reusePersistentVolumeClaim=true
> --conf spark.kubernetes.driver.ownPersistentVolumeClaim=true
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=gp2
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=100Gi
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false
> --conf spark.executorEnv.SPARK_EXECUTOR_DIRS=/tmp/data
> --conf 
> spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO
> --conf spark.kubernetes.executor.missingPodDetectDelta=10s
> --conf spark.kubernetes.executor.apiPollingInterval=10s
> --conf spark.shuffle.io.retryWait=60s
> --conf spark.shuffle.io.maxRetries=5
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40298) shuffle data recovery on the reused PVCs no effect

2022-09-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40298.
--
Resolution: Invalid

> shuffle data recovery on the reused PVCs  no effect
> ---
>
> Key: SPARK-40298
> URL: https://issues.apache.org/jira/browse/SPARK-40298
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: 1662002808396.jpg, 1662002822097.jpg
>
>
> I use spark3.2.2 to test the [ Support shuffle data recovery on the reused 
> PVCs (SPARK-35593) ] feature.I found that when shuffle read fails, data is 
> still read from source.
> It can be confirmed that the pvc has been multiplexed by other pods, and the 
> Index and data data information has been sent
> *This is my spark configuration information:*
> --conf spark.driver.memory=5G 
> --conf spark.executor.memory=15G 
> --conf spark.executor.cores=1
> --conf spark.executor.instances=50
> --conf spark.sql.shuffle.partitions=50
> --conf spark.dynamicAllocation.enabled=false
> --conf spark.kubernetes.driver.reusePersistentVolumeClaim=true
> --conf spark.kubernetes.driver.ownPersistentVolumeClaim=true
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=gp2
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=100Gi
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false
> --conf spark.executorEnv.SPARK_EXECUTOR_DIRS=/tmp/data
> --conf 
> spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO
> --conf spark.kubernetes.executor.missingPodDetectDelta=10s
> --conf spark.kubernetes.executor.apiPollingInterval=10s
> --conf spark.shuffle.io.retryWait=60s
> --conf spark.shuffle.io.maxRetries=5
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40298) shuffle data recovery on the reused PVCs no effect

2022-09-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600196#comment-17600196
 ] 

Hyukjin Kwon commented on SPARK-40298:
--

[~todd5167] for questions, it's better to interact in dev maling list. you'd 
get a better answer from there.

> shuffle data recovery on the reused PVCs  no effect
> ---
>
> Key: SPARK-40298
> URL: https://issues.apache.org/jira/browse/SPARK-40298
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: 1662002808396.jpg, 1662002822097.jpg
>
>
> I use spark3.2.2 to test the [ Support shuffle data recovery on the reused 
> PVCs (SPARK-35593) ] feature.I found that when shuffle read fails, data is 
> still read from source.
> It can be confirmed that the pvc has been multiplexed by other pods, and the 
> Index and data data information has been sent
> *This is my spark configuration information:*
> --conf spark.driver.memory=5G 
> --conf spark.executor.memory=15G 
> --conf spark.executor.cores=1
> --conf spark.executor.instances=50
> --conf spark.sql.shuffle.partitions=50
> --conf spark.dynamicAllocation.enabled=false
> --conf spark.kubernetes.driver.reusePersistentVolumeClaim=true
> --conf spark.kubernetes.driver.ownPersistentVolumeClaim=true
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=gp2
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=100Gi
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false
> --conf spark.executorEnv.SPARK_EXECUTOR_DIRS=/tmp/data
> --conf 
> spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO
> --conf spark.kubernetes.executor.missingPodDetectDelta=10s
> --conf spark.kubernetes.executor.apiPollingInterval=10s
> --conf spark.shuffle.io.retryWait=60s
> --conf spark.shuffle.io.maxRetries=5
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40299) java api calls the count() method to appear: java.lang.ArithmeticException: BigInteger would overflow supported range

2022-09-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40299.
--
Resolution: Cannot Reproduce

> java api calls the count() method to appear: java.lang.ArithmeticException: 
> BigInteger would overflow supported range
> -
>
> Key: SPARK-40299
> URL: https://issues.apache.org/jira/browse/SPARK-40299
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.2
>Reporter: code1v5
>Priority: Major
>
> ive Session ID = a372ea31-ac98-4e01-9de3-dfb623df87a4
> 22/09/01 13:50:32 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> [Stage 0:>                                                          (0 + 8) / 
> 8]22/09/01 13:50:41 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, 
> hdp3-10-106, executor 6): java.lang.ArithmeticException: BigInteger would 
> overflow supported range
>     at java.math.BigInteger.reportOverflow(BigInteger.java:1084)
>     at java.math.BigInteger.pow(BigInteger.java:2391)
>     at java.math.BigDecimal.bigTenToThe(BigDecimal.java:3574)
>     at java.math.BigDecimal.bigMultiplyPowerTen(BigDecimal.java:3707)
>     at java.math.BigDecimal.setScale(BigDecimal.java:2448)
>     at java.math.BigDecimal.setScale(BigDecimal.java:2515)
>     at 
> org.apache.hadoop.hive.common.type.HiveDecimal.trim(HiveDecimal.java:241)
>     at 
> org.apache.hadoop.hive.common.type.HiveDecimal.normalize(HiveDecimal.java:252)
>     at 
> org.apache.hadoop.hive.common.type.HiveDecimal.create(HiveDecimal.java:83)
>     at 
> org.apache.hadoop.hive.serde2.lazy.LazyHiveDecimal.init(LazyHiveDecimal.java:79)
>     at 
> org.apache.hadoop.hive.serde2.lazy.LazyStruct.uncheckedGetField(LazyStruct.java:226)
>     at 
> org.apache.hadoop.hive.serde2.lazy.LazyStruct.getField(LazyStruct.java:202)
>     at 
> org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldData(LazySimpleStructObjectInspector.java:128)
>     at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:439)
>     at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:434)
>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>     at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>     at org.apache.spark.scheduler.Task.run(Task.scala:109)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> 22/09/01 13:50:42 ERROR TaskSetManager: Task 5 in stage 0.0 failed 4 times; 
> aborting job
> 22/09/01 13:50:42 WARN TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, 
> hdp2-10-105, executor 8): TaskKilled (Stage cancelled)
> [Stage 0:>                                                          (0 + 6) / 
> 8]org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 
> in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 
> (TID 10, hdp3-10-106, executor 6): java.lang.ArithmeticException: BigInteger 
> would overflow supported range
>     at java.math.BigInteger.reportOverflow(BigInteger.java:1084)
>     at java.math.BigInteger.pow(BigInteger.java:2391)
>     at java.math.BigDecimal.bigTenToThe(BigDecimal.java:3574)
>     at java.math.BigDecimal.bigMultiplyPowerTen(BigDecimal.java:3707)
>     at java.math.BigDecimal.setScale(BigDecimal.java:2448)
>     at java.math.BigDecimal.setScale(BigDecimal.java:2515)
>     at 
> org.apache.hadoop.hive.common.type.HiveDecimal.trim(HiveDecimal.java:241)
>     at 
> 

[jira] [Updated] (SPARK-40317) Improvement to JDBC predicate for queries involving joins

2022-09-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40317:
-
Component/s: SQL
 (was: Spark Core)

> Improvement to JDBC predicate for queries involving joins
> -
>
> Key: SPARK-40317
> URL: https://issues.apache.org/jira/browse/SPARK-40317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: David Ahern
>Priority: Major
>
> Current behaviour on tables involving joins seems to use a subquery as follows
>  
> select * from
> (
> select a, b, c from tbl1
> lj tbl2 on tbl1.col1 = tbl2.col1
> lj tbl3 on tbl1.col2 = tbl3.col2
> )
> where predicate = 1
> where predicate = 2
> where predicate = 3
>  
> More desirable would be
> (
> select a, b, c from tbl1 where (predicate = 1, predicate = 2, etc)
> lj tbl2 on tbl1.col1 = tbl2.col1
> lj tbl3 on tbl1.col2 = tbl3.col2
> )
>  
> to just do the join on the subset of data rather than joining all data then 
> filtering.  Predicate pushdown usually only works on columns that have been 
> indexes.  So even if the data isn't indexed, this would reduce amount of data 
> needing to be moved.  In many cases better to do the join on DB side than 
> pulling everything into Spark.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40331) Java 11 should be used as the recommended running environment

2022-09-04 Thread Yang Jie (Jira)
Yang Jie created SPARK-40331:


 Summary: Java 11 should be used as the recommended running 
environment
 Key: SPARK-40331
 URL: https://issues.apache.org/jira/browse/SPARK-40331
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.4.0
Reporter: Yang Jie


Similar cases described in SPARK-40303  will not have negative effects if Java 
11+ is used as runtime

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40271) Support list type for pyspark.sql.functions.lit

2022-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600193#comment-17600193
 ] 

Apache Spark commented on SPARK-40271:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37798

> Support list type for pyspark.sql.functions.lit
> ---
>
> Key: SPARK-40271
> URL: https://issues.apache.org/jira/browse/SPARK-40271
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, `pyspark.sql.functions.lit` doesn't support for Python list type 
> as below:
> {code:python}
> >>> df = spark.range(3).withColumn("c", lit([1,2,3]))
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] 
> The feature is not supported: Literal for '[1, 2, 3]' of class 
> java.util.ArrayList.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302)
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100)
>   at org.apache.spark.sql.functions$.lit(functions.scala:125)
>   at org.apache.spark.sql.functions.lit(functions.scala)
>   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:577)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.base/java.lang.Thread.run(Thread.java:833)
> {code}
> We should make it supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40330) Implement `Series.searchsorted`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40330:
---

 Summary: Implement `Series.searchsorted`.
 Key: SPARK-40330
 URL: https://issues.apache.org/jira/browse/SPARK-40330
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `Series.searchsorted` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.Series.searchsorted.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-04 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40327:

Component/s: Pandas API on Spark
 (was: ps)

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40328) Implement `DataFrame.compare`.

2022-09-04 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40328:

Component/s: Pandas API on Spark
 (was: ps)

> Implement `DataFrame.compare`.
> --
>
> Key: SPARK-40328
> URL: https://issues.apache.org/jira/browse/SPARK-40328
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should implement `DataFrame.compare` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40329) Implement `Series.compare`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40329:
---

 Summary: Implement `Series.compare`.
 Key: SPARK-40329
 URL: https://issues.apache.org/jira/browse/SPARK-40329
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement `Series.compare` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.Series.compare.html#pandas.Series.compare



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40328) Implement `DataFrame.compare`.

2022-09-04 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40328:

Description: 
We should implement `DataFrame.compare` for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html.

  was:
We should implement DataFrame.compare for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html.


> Implement `DataFrame.compare`.
> --
>
> Key: SPARK-40328
> URL: https://issues.apache.org/jira/browse/SPARK-40328
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should implement `DataFrame.compare` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40328) Implement `DataFrame.compare`.

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40328:
---

 Summary: Implement `DataFrame.compare`.
 Key: SPARK-40328
 URL: https://issues.apache.org/jira/browse/SPARK-40328
 Project: Spark
  Issue Type: Sub-task
  Components: ps
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should implement DataFrame.compare for increasing pandas API coverage.

pandas docs: 
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-04 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40327:

Description: Increasing the pandas API coverage for Apache Spark 3.4.0.  
(was: Increasing the pandas API coverage for Apache Spark 3.4.0, as we did for 
Apache Spark 3.3.0 in https://issues.apache.org/jira/browse/SPARK-36394.)

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40327:
---

 Summary: Increase pandas API coverage for pandas API on Spark
 Key: SPARK-40327
 URL: https://issues.apache.org/jira/browse/SPARK-40327
 Project: Spark
  Issue Type: Umbrella
  Components: ps
Affects Versions: 3.4.0
Reporter: Haejoon Lee


Increasing the pandas API coverage for Apache Spark 3.4.0, as we did for Apache 
Spark 3.3.0 in https://issues.apache.org/jira/browse/SPARK-36394.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-09-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40149:
-
Priority: Blocker  (was: Major)

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Priority: Blocker
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-09-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40149:
-
Target Version/s: 3.4.0

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Priority: Blocker
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-09-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40149:
-
Target Version/s:   (was: 3.4.0)

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Priority: Blocker
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-09-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40149:
-
Priority: Major  (was: Blocker)

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Priority: Major
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40142) Make pyspark.sql.functions examples self-contained

2022-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600111#comment-17600111
 ] 

Apache Spark commented on SPARK-40142:
--

User 'khalidmammadov' has created a pull request for this issue:
https://github.com/apache/spark/pull/37797

> Make pyspark.sql.functions examples self-contained
> --
>
> Key: SPARK-40142
> URL: https://issues.apache.org/jira/browse/SPARK-40142
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40142) Make pyspark.sql.functions examples self-contained

2022-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600110#comment-17600110
 ] 

Apache Spark commented on SPARK-40142:
--

User 'khalidmammadov' has created a pull request for this issue:
https://github.com/apache/spark/pull/37797

> Make pyspark.sql.functions examples self-contained
> --
>
> Key: SPARK-40142
> URL: https://issues.apache.org/jira/browse/SPARK-40142
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4

2022-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40326:


Assignee: (was: Apache Spark)

> upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 
> to 2.13.4
> --
>
> Key: SPARK-40326
> URL: https://issues.apache.org/jira/browse/SPARK-40326
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857]
> [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4

2022-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40326:


Assignee: Apache Spark

> upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 
> to 2.13.4
> --
>
> Key: SPARK-40326
> URL: https://issues.apache.org/jira/browse/SPARK-40326
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Apache Spark
>Priority: Major
>
> [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857]
> [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4

2022-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600084#comment-17600084
 ] 

Apache Spark commented on SPARK-40326:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/37796

> upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 
> to 2.13.4
> --
>
> Key: SPARK-40326
> URL: https://issues.apache.org/jira/browse/SPARK-40326
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857]
> [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4

2022-09-04 Thread Jira
Bjørn Jørgensen created SPARK-40326:
---

 Summary: upgrade 
com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4
 Key: SPARK-40326
 URL: https://issues.apache.org/jira/browse/SPARK-40326
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build
Affects Versions: 3.4.0
Reporter: Bjørn Jørgensen


[CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857]

[SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40321) Upgrade rocksdbjni to 7.5.3

2022-09-04 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40321:


Assignee: Yang Jie

> Upgrade rocksdbjni to 7.5.3
> ---
>
> Key: SPARK-40321
> URL: https://issues.apache.org/jira/browse/SPARK-40321
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> https://github.com/facebook/rocksdb/releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40321) Upgrade rocksdbjni to 7.5.3

2022-09-04 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40321.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37783
[https://github.com/apache/spark/pull/37783]

> Upgrade rocksdbjni to 7.5.3
> ---
>
> Key: SPARK-40321
> URL: https://issues.apache.org/jira/browse/SPARK-40321
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> https://github.com/facebook/rocksdb/releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39996) Upgrade postgresql to 42.5.0

2022-09-04 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-39996:


Assignee: Bjørn Jørgensen

> Upgrade postgresql to 42.5.0
> 
>
> Key: SPARK-39996
> URL: https://issues.apache.org/jira/browse/SPARK-39996
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>
> Security
> - fix: 
> [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197]
>  Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so 
> as to prevent SQL injection.
>   - Previously, the column names for both key and data columns in the table 
> were copied as-is into the generated
>   SQL. This allowed a malicious table with column names that include 
> statement terminator to be parsed and
>   executed as multiple separate commands.
>   - Also adds a new test class ResultSetRefreshTest to verify this change.
>   - Reported by [Sho Kato](https://github.com/kato-sho)
> [Release 
> note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39996) Upgrade postgresql to 42.5.0

2022-09-04 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39996.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37762
[https://github.com/apache/spark/pull/37762]

> Upgrade postgresql to 42.5.0
> 
>
> Key: SPARK-39996
> URL: https://issues.apache.org/jira/browse/SPARK-39996
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.4.0
>
>
> Security
> - fix: 
> [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197]
>  Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so 
> as to prevent SQL injection.
>   - Previously, the column names for both key and data columns in the table 
> were copied as-is into the generated
>   SQL. This allowed a malicious table with column names that include 
> statement terminator to be parsed and
>   executed as multiple separate commands.
>   - Also adds a new test class ResultSetRefreshTest to verify this change.
>   - Reported by [Sho Kato](https://github.com/kato-sho)
> [Release 
> note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39996) Upgrade postgresql to 42.5.0

2022-09-04 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-39996:
-
Component/s: Tests

> Upgrade postgresql to 42.5.0
> 
>
> Key: SPARK-39996
> URL: https://issues.apache.org/jira/browse/SPARK-39996
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.4.0
>
>
> Security
> - fix: 
> [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197]
>  Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so 
> as to prevent SQL injection.
>   - Previously, the column names for both key and data columns in the table 
> were copied as-is into the generated
>   SQL. This allowed a malicious table with column names that include 
> statement terminator to be parsed and
>   executed as multiple separate commands.
>   - Also adds a new test class ResultSetRefreshTest to verify this change.
>   - Reported by [Sho Kato](https://github.com/kato-sho)
> [Release 
> note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39996) Upgrade postgresql to 42.5.0

2022-09-04 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-39996:
-
Priority: Minor  (was: Major)

> Upgrade postgresql to 42.5.0
> 
>
> Key: SPARK-39996
> URL: https://issues.apache.org/jira/browse/SPARK-39996
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.4.0
>
>
> Security
> - fix: 
> [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197]
>  Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so 
> as to prevent SQL injection.
>   - Previously, the column names for both key and data columns in the table 
> were copied as-is into the generated
>   SQL. This allowed a malicious table with column names that include 
> statement terminator to be parsed and
>   executed as multiple separate commands.
>   - Also adds a new test class ResultSetRefreshTest to verify this change.
>   - Reported by [Sho Kato](https://github.com/kato-sho)
> [Release 
> note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40325) Support of Columnar result(ColumnarBatch) in org.apache.spark.sql.Dataset flatMap, transform, etc

2022-09-04 Thread Igor Suhorukov (Jira)
Igor Suhorukov created SPARK-40325:
--

 Summary: Support of Columnar result(ColumnarBatch) in 
org.apache.spark.sql.Dataset flatMap, transform, etc
 Key: SPARK-40325
 URL: https://issues.apache.org/jira/browse/SPARK-40325
 Project: Spark
  Issue Type: New Feature
  Components: Java API, Spark Core
Affects Versions: 3.3.0
Reporter: Igor Suhorukov


Sometimes result of data transformation in JVM program available from native 
code in Apache Arrow columnar data format. Current Dataset API require 
unnecessary data transform from columnar format wrapper into row with 
additional allocation on JVM heap. 

In this proposed feature I ask for propagation of columnar data in DatasetAPI 
without unnecessary InternalRow->Row->InternalRow conversion.

 

Current solution use [ColumnarBatch 
wrapper|https://github.com/igor-suhorukov/spark3/blob/master/src/main/java/com/github/igorsuhorukov/arrow/spark/ArrowDataIterator.java]
 on top of ArrowColumnVector and rowExpressionEncoder.createDeserializer() to 
transform data [into 
Row|https://github.com/igor-suhorukov/spark3/blob/c655d4b6058fdd4529aa59093edfe2333d96fb05/src/main/java/com/github/igorsuhorukov/arrow/spark/ArrowDataIterator.java#L53]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40324) Provide a query context of ParseException

2022-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40324:


Assignee: Apache Spark  (was: Max Gekk)

> Provide a query context of ParseException
> -
>
> Key: SPARK-40324
> URL: https://issues.apache.org/jira/browse/SPARK-40324
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Extends the exception ParseException and add a queryContext into it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40324) Provide a query context of ParseException

2022-09-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40324:


Assignee: Max Gekk  (was: Apache Spark)

> Provide a query context of ParseException
> -
>
> Key: SPARK-40324
> URL: https://issues.apache.org/jira/browse/SPARK-40324
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Extends the exception ParseException and add a queryContext into it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40324) Provide a query context of ParseException

2022-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600047#comment-17600047
 ] 

Apache Spark commented on SPARK-40324:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37794

> Provide a query context of ParseException
> -
>
> Key: SPARK-40324
> URL: https://issues.apache.org/jira/browse/SPARK-40324
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Extends the exception ParseException and add a queryContext into it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40324) Provide a query context of ParseException

2022-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600048#comment-17600048
 ] 

Apache Spark commented on SPARK-40324:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37794

> Provide a query context of ParseException
> -
>
> Key: SPARK-40324
> URL: https://issues.apache.org/jira/browse/SPARK-40324
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Extends the exception ParseException and add a queryContext into it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40324) Provide a query context of ParseException

2022-09-04 Thread Max Gekk (Jira)
Max Gekk created SPARK-40324:


 Summary: Provide a query context of ParseException
 Key: SPARK-40324
 URL: https://issues.apache.org/jira/browse/SPARK-40324
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk


Extends the exception ParseException and add a queryContext into it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40251) Upgrade dev.ludovic.netlib from 2.2.1 to 3.0.2

2022-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600035#comment-17600035
 ] 

Apache Spark commented on SPARK-40251:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37791

> Upgrade dev.ludovic.netlib from 2.2.1 to 3.0.2
> --
>
> Key: SPARK-40251
> URL: https://issues.apache.org/jira/browse/SPARK-40251
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> https://github.com/luhenry/netlib/compare/v2.2.1...v3.0.2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.

2022-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600032#comment-17600032
 ] 

Apache Spark commented on SPARK-40288:
--

User 'hgs19921112' has created a pull request for this issue:
https://github.com/apache/spark/pull/37790

> After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should 
> applied to avoid attribute missing when use complex expression.
> --
>
> Key: SPARK-40288
> URL: https://issues.apache.org/jira/browse/SPARK-40288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
> Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0
>Reporter: hgs
>Priority: Minor
>
> {{--table}}
> {{create}}  {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} 
> {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}}
> {{--data}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr 
> }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}}
> {{--failure sql}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n 
> }}{{{}from{}}}{{{}({}}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage 
> }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}}
> {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}}
> --error stack
> {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in 
> [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) 
> 100 else 200#12]}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}}
> {{at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.

2022-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600031#comment-17600031
 ] 

Apache Spark commented on SPARK-40288:
--

User 'hgs19921112' has created a pull request for this issue:
https://github.com/apache/spark/pull/37790

> After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should 
> applied to avoid attribute missing when use complex expression.
> --
>
> Key: SPARK-40288
> URL: https://issues.apache.org/jira/browse/SPARK-40288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
> Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0
>Reporter: hgs
>Priority: Minor
>
> {{--table}}
> {{create}}  {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} 
> {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}}
> {{--data}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr 
> }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}}
> {{--failure sql}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n 
> }}{{{}from{}}}{{{}({}}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage 
> }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}}
> {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}}
> --error stack
> {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in 
> [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) 
> 100 else 200#12]}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}}
> {{at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.

2022-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600029#comment-17600029
 ] 

Apache Spark commented on SPARK-40288:
--

User 'hgs19921112' has created a pull request for this issue:
https://github.com/apache/spark/pull/37788

> After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should 
> applied to avoid attribute missing when use complex expression.
> --
>
> Key: SPARK-40288
> URL: https://issues.apache.org/jira/browse/SPARK-40288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
> Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0
>Reporter: hgs
>Priority: Minor
>
> {{--table}}
> {{create}}  {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} 
> {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}}
> {{--data}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr 
> }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}}
> {{--failure sql}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n 
> }}{{{}from{}}}{{{}({}}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage 
> }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}}
> {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}}
> --error stack
> {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in 
> [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) 
> 100 else 200#12]}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}}
> {{at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.

2022-09-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600028#comment-17600028
 ] 

Apache Spark commented on SPARK-40288:
--

User 'hgs19921112' has created a pull request for this issue:
https://github.com/apache/spark/pull/37788

> After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should 
> applied to avoid attribute missing when use complex expression.
> --
>
> Key: SPARK-40288
> URL: https://issues.apache.org/jira/browse/SPARK-40288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
> Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0
>Reporter: hgs
>Priority: Minor
>
> {{--table}}
> {{create}}  {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} 
> {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}}
> {{--data}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr 
> }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}}
> {{--failure sql}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n 
> }}{{{}from{}}}{{{}({}}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage 
> }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}}
> {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}}
> --error stack
> {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in 
> [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) 
> 100 else 200#12]}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}}
> {{at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39796) Add a regexp_extract variant which returns an array of all the matched capture groups

2022-09-04 Thread Augustine Theodore Prince (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600021#comment-17600021
 ] 

Augustine Theodore Prince commented on SPARK-39796:
---

Hi [~planga82] ,
{code:java}
df.withColumn("g1", regexp_extract('a, regex, 1)).withColumn("g2", 
regexp_extract('a, regex, 2)).show
{code}
In the above statement, the regular expression is compiled and processed twice. 
Wouldn't it be more performant to compile once and extract all the groups so 
that the expression is processed only once?

> Add a regexp_extract variant which returns an array of all the matched 
> capture groups
> -
>
> Key: SPARK-39796
> URL: https://issues.apache.org/jira/browse/SPARK-39796
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: Augustine Theodore Prince
>Priority: Minor
>  Labels: regexp_extract, regexp_extract_all, regexp_replace
>
>  
> regexp_extract only returns a single matched group. In a lot of cases we need 
> to parse the entire string and get all the groups and for that we'll need to 
> call it as many times as there are groups. The regexp_extract_all function 
> doesn't solve this problem as it only works if all the groups have the same 
> regex pattern.
>  
> _Example:_
> I will provide an example and the current workaround that I use to solve this,
> If I have the following dataframe and I would like to match the column 'a' 
> with this pattern
> {code:java}
> "([A-Za-z]+), [A-Za-z]+, (\\d+)"{code}
> |a|
> |Hello, World, 1234|
> |Good, bye, friend|
>  
> My expected output  is as follows:
> |a|extracted_a|
> |Hello, World, 1234|[Hello, 1234]|
> |Good, bye, friend|[]|
>  
> However, to achieve this I have to take the following approach which seems 
> very hackish.
> 1. Use regexp_replace to create a temporary string built using the extracted 
> groups:
> {code:java}
> df.withColumn("extr" , F.regexp_replace("a", "([A-Za-z]+), [A-Za-z]+, 
> (\\d+)", "$1_$2")){code}
> A side effect of regexp_replace is that if the regex fails to match the 
> entire string is returned.
>  
> |a|extracted_a|
> |Hello, World, 1234|Hello_1234|
> |Good, bye, friend|Good, bye, friend|
> 2. So, to achieve the desired result, a check has to be done to prune the 
> rows that did not match with the pattern :
> {code:java}
> df = df.withColumn("extracted_a" , F.when(F.col("extracted_a")==F.col("a") , 
> None).otherwise(F.col("extracted_a"))){code}
>  
> to get the following intermediate dataframe,
> |a|extracted_a|
> |Hello, World, 1234|Hello_1234|
> |Good, bye, friend|null|
>  
> 3. Before finally splitting the column 'extracted_a' based on underscores
> {code:java}
> df = df.withColumn("extracted_a" , F.split("extracted_a" , "[_]")){code}
> which results in the desired result:
>  
>  
> |a|extracted_a
> |
> |Hello, World, 1234|[Hello, 1234]|
> |Good, bye, friend|null|
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40301) Add parameter validation in pyspark.rdd

2022-09-04 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40301:
-

Assignee: Ruifeng Zheng

> Add parameter validation in pyspark.rdd
> ---
>
> Key: SPARK-40301
> URL: https://issues.apache.org/jira/browse/SPARK-40301
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40301) Add parameter validation in pyspark.rdd

2022-09-04 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-40301.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37752
[https://github.com/apache/spark/pull/37752]

> Add parameter validation in pyspark.rdd
> ---
>
> Key: SPARK-40301
> URL: https://issues.apache.org/jira/browse/SPARK-40301
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40305) Implement Groupby.sem

2022-09-04 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40305:
-

Assignee: Ruifeng Zheng

> Implement Groupby.sem
> -
>
> Key: SPARK-40305
> URL: https://issues.apache.org/jira/browse/SPARK-40305
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40305) Implement Groupby.sem

2022-09-04 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-40305.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37756
[https://github.com/apache/spark/pull/37756]

> Implement Groupby.sem
> -
>
> Key: SPARK-40305
> URL: https://issues.apache.org/jira/browse/SPARK-40305
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org