[jira] [Created] (SPARK-35441) InMemoryFileIndex load all files into memroy

2021-05-19 Thread Zhang Jianguo (Jira)
Zhang Jianguo created SPARK-35441:
-

 Summary: InMemoryFileIndex load all files into memroy
 Key: SPARK-35441
 URL: https://issues.apache.org/jira/browse/SPARK-35441
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Zhang Jianguo


When InMemoryFileIndex is initialated, it'll load all files of ${rootPath}. If 
there're thousands of files, it could lead to OOM.

 

https://github.com/apache/spark/blob/0b3758e8cdb3eaa9d55ce3b41ecad5fa01567343/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used

2021-05-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35106:
---

Assignee: Erik Krogen

> HadoopMapReduceCommitProtocol performs bad rename when dynamic partition 
> overwrite is used
> --
>
> Key: SPARK-35106
> URL: https://issues.apache.org/jira/browse/SPARK-35106
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
>
> Recently when evaluating the code in 
> {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under 
> the {{dynamicPartitionOverwrite == true}} scenario:
> {code:language=scala}
>   # BLOCK 1
>   if (dynamicPartitionOverwrite) {
> val absPartitionPaths = filesToMove.values.map(new 
> Path(_).getParent).toSet
> logDebug(s"Clean up absolute partition directories for overwriting: 
> $absPartitionPaths")
> absPartitionPaths.foreach(fs.delete(_, true))
>   }
>   # BLOCK 2
>   for ((src, dst) <- filesToMove) {
> fs.rename(new Path(src), new Path(dst))
>   }
>   # BLOCK 3
>   if (dynamicPartitionOverwrite) {
> val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _)
> logDebug(s"Clean up default partition directories for overwriting: 
> $partitionPaths")
> for (part <- partitionPaths) {
>   val finalPartPath = new Path(path, part)
>   if (!fs.delete(finalPartPath, true) && 
> !fs.exists(finalPartPath.getParent)) {
> // According to the official hadoop FileSystem API spec, delete 
> op should assume
> // the destination is no longer present regardless of return 
> value, thus we do not
> // need to double check if finalPartPath exists before rename.
> // Also in our case, based on the spec, delete returns false only 
> when finalPartPath
> // does not exist. When this happens, we need to take action if 
> parent of finalPartPath
> // also does not exist(e.g. the scenario described on 
> SPARK-23815), because
> // FileSystem API spec on rename op says the rename 
> dest(finalPartPath) must have
> // a parent that exists, otherwise we may get unexpected result 
> on the rename.
> fs.mkdirs(finalPartPath.getParent)
>   }
>   fs.rename(new Path(stagingDir, part), finalPartPath)
> }
>   }
> {code}
> Assuming {{dynamicPartitionOverwrite == true}}, we have the following 
> sequence of events:
> # Block 1 deletes all parent directories of {{filesToMove.values}}
> # Block 2 attempts to rename all {{filesToMove.keys}} to 
> {{filesToMove.values}}
> # Block 3 does directory-level renames to place files into their final 
> locations
> All renames in Block 2 will always fail, since all parent directories of 
> {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS 
> scenario, the contract of {{fs.rename}} is to return {{false}} under such a 
> failure scenario, as opposed to throwing an exception. There is a separate 
> issue here that Block 2 should probably be checking for those {{false}} 
> return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", 
> albeit with a bunch of failed renames in the middle. Really, we should only 
> run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and 
> consolidate Blocks 1 and 3 to run in the {{true}} case.
> We discovered this issue when testing against a {{FileSystem}} implementation 
> which was throwing an exception for this failed rename scenario instead of 
> returning false, escalating the silent/ignored rename failures into actual 
> failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used

2021-05-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35106.
-
Fix Version/s: 3.0.3
   3.1.2
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 32530
[https://github.com/apache/spark/pull/32530]

> HadoopMapReduceCommitProtocol performs bad rename when dynamic partition 
> overwrite is used
> --
>
> Key: SPARK-35106
> URL: https://issues.apache.org/jira/browse/SPARK-35106
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> Recently when evaluating the code in 
> {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under 
> the {{dynamicPartitionOverwrite == true}} scenario:
> {code:language=scala}
>   # BLOCK 1
>   if (dynamicPartitionOverwrite) {
> val absPartitionPaths = filesToMove.values.map(new 
> Path(_).getParent).toSet
> logDebug(s"Clean up absolute partition directories for overwriting: 
> $absPartitionPaths")
> absPartitionPaths.foreach(fs.delete(_, true))
>   }
>   # BLOCK 2
>   for ((src, dst) <- filesToMove) {
> fs.rename(new Path(src), new Path(dst))
>   }
>   # BLOCK 3
>   if (dynamicPartitionOverwrite) {
> val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _)
> logDebug(s"Clean up default partition directories for overwriting: 
> $partitionPaths")
> for (part <- partitionPaths) {
>   val finalPartPath = new Path(path, part)
>   if (!fs.delete(finalPartPath, true) && 
> !fs.exists(finalPartPath.getParent)) {
> // According to the official hadoop FileSystem API spec, delete 
> op should assume
> // the destination is no longer present regardless of return 
> value, thus we do not
> // need to double check if finalPartPath exists before rename.
> // Also in our case, based on the spec, delete returns false only 
> when finalPartPath
> // does not exist. When this happens, we need to take action if 
> parent of finalPartPath
> // also does not exist(e.g. the scenario described on 
> SPARK-23815), because
> // FileSystem API spec on rename op says the rename 
> dest(finalPartPath) must have
> // a parent that exists, otherwise we may get unexpected result 
> on the rename.
> fs.mkdirs(finalPartPath.getParent)
>   }
>   fs.rename(new Path(stagingDir, part), finalPartPath)
> }
>   }
> {code}
> Assuming {{dynamicPartitionOverwrite == true}}, we have the following 
> sequence of events:
> # Block 1 deletes all parent directories of {{filesToMove.values}}
> # Block 2 attempts to rename all {{filesToMove.keys}} to 
> {{filesToMove.values}}
> # Block 3 does directory-level renames to place files into their final 
> locations
> All renames in Block 2 will always fail, since all parent directories of 
> {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS 
> scenario, the contract of {{fs.rename}} is to return {{false}} under such a 
> failure scenario, as opposed to throwing an exception. There is a separate 
> issue here that Block 2 should probably be checking for those {{false}} 
> return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", 
> albeit with a bunch of failed renames in the middle. Really, we should only 
> run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and 
> consolidate Blocks 1 and 3 to run in the {{true}} case.
> We discovered this issue when testing against a {{FileSystem}} implementation 
> which was throwing an exception for this failed rename scenario instead of 
> returning false, escalating the silent/ignored rename failures into actual 
> failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35368) [SQL]Update histogram statistics for RANGE operator stats estimation

2021-05-19 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-35368.
--
Fix Version/s: 3.2.0
 Assignee: shahid
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/32498

> [SQL]Update histogram statistics for RANGE operator stats estimation
> 
>
> Key: SPARK-35368
> URL: https://issues.apache.org/jira/browse/SPARK-35368
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35441) InMemoryFileIndex load all files into memroy

2021-05-19 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347404#comment-17347404
 ] 

Yuming Wang commented on SPARK-35441:
-

What is your driver memory?

> InMemoryFileIndex load all files into memroy
> 
>
> Key: SPARK-35441
> URL: https://issues.apache.org/jira/browse/SPARK-35441
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Zhang Jianguo
>Priority: Minor
>
> When InMemoryFileIndex is initialated, it'll load all files of ${rootPath}. 
> If there're thousands of files, it could lead to OOM.
>  
> https://github.com/apache/spark/blob/0b3758e8cdb3eaa9d55ce3b41ecad5fa01567343/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35442) Eliminate unnecessary join through Aggregate

2021-05-19 Thread XiDuo You (Jira)
XiDuo You created SPARK-35442:
-

 Summary: Eliminate unnecessary join through Aggregate
 Key: SPARK-35442
 URL: https://issues.apache.org/jira/browse/SPARK-35442
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: XiDuo You


If Aggregate and Join has the same output partitioning, the plan will look like

 
{code:java}
 SortMergeJoin
   Sort
 HashAggregate
   Shuffle
   Sort
 xxx{code}
 

To imporve the `EliminateUnnecessaryJoin`, we can support this case.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35442) Eliminate unnecessary join through Aggregate

2021-05-19 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-35442:
--
Description: 
If Aggregate and Join have the same output partitioning, the plan will look 
like:
{code:java}
 SortMergeJoin
   Sort
 HashAggregate
   Shuffle
   Sort
 xxx{code}
 

To imporve the `EliminateUnnecessaryJoin`, we can support this case.

 

  was:
If Aggregate and Join has the same output partitioning, the plan will look like

 
{code:java}
 SortMergeJoin
   Sort
 HashAggregate
   Shuffle
   Sort
 xxx{code}
 

To imporve the `EliminateUnnecessaryJoin`, we can support this case.

 


> Eliminate unnecessary join through Aggregate
> 
>
> Key: SPARK-35442
> URL: https://issues.apache.org/jira/browse/SPARK-35442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Priority: Minor
>
> If Aggregate and Join have the same output partitioning, the plan will look 
> like:
> {code:java}
>  SortMergeJoin
>Sort
>  HashAggregate
>Shuffle
>Sort
>  xxx{code}
>  
> To imporve the `EliminateUnnecessaryJoin`, we can support this case.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35442) Eliminate unnecessary join through Aggregate

2021-05-19 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-35442:
--
Description: 
If Aggregate and Join have the same output partitioning, the plan will look 
like:
{code:java}
 SortMergeJoin
   Sort
 HashAggregate
   Shuffle
   Sort
 xxx{code}
 

Currently `EliminateUnnecessaryJoin` doesn't support optimize this case. 
Logically, if the Aggregate grouping expression is not empty, we can safe 
eliminate it.

 

  was:
If Aggregate and Join have the same output partitioning, the plan will look 
like:
{code:java}
 SortMergeJoin
   Sort
 HashAggregate
   Shuffle
   Sort
 xxx{code}
 

To imporve the `EliminateUnnecessaryJoin`, we can support this case.

 


> Eliminate unnecessary join through Aggregate
> 
>
> Key: SPARK-35442
> URL: https://issues.apache.org/jira/browse/SPARK-35442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Priority: Minor
>
> If Aggregate and Join have the same output partitioning, the plan will look 
> like:
> {code:java}
>  SortMergeJoin
>Sort
>  HashAggregate
>Shuffle
>Sort
>  xxx{code}
>  
> Currently `EliminateUnnecessaryJoin` doesn't support optimize this case. 
> Logically, if the Aggregate grouping expression is not empty, we can safe 
> eliminate it.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35442) Eliminate unnecessary join through Aggregate

2021-05-19 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-35442:
--
Description: 
If Aggregate and Join have the same output partitioning, the plan will look 
like:
{code:java}
 SortMergeJoin
   Sort
 HashAggregate
   Shuffle
   Sort
 xxx{code}
 

Currently `EliminateUnnecessaryJoin` doesn't support optimize this case. 
Logically, if the Aggregate grouping expression is not empty, we can eliminate 
it safely.

 

  was:
If Aggregate and Join have the same output partitioning, the plan will look 
like:
{code:java}
 SortMergeJoin
   Sort
 HashAggregate
   Shuffle
   Sort
 xxx{code}
 

Currently `EliminateUnnecessaryJoin` doesn't support optimize this case. 
Logically, if the Aggregate grouping expression is not empty, we can safe 
eliminate it.

 


> Eliminate unnecessary join through Aggregate
> 
>
> Key: SPARK-35442
> URL: https://issues.apache.org/jira/browse/SPARK-35442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Priority: Minor
>
> If Aggregate and Join have the same output partitioning, the plan will look 
> like:
> {code:java}
>  SortMergeJoin
>Sort
>  HashAggregate
>Shuffle
>Sort
>  xxx{code}
>  
> Currently `EliminateUnnecessaryJoin` doesn't support optimize this case. 
> Logically, if the Aggregate grouping expression is not empty, we can 
> eliminate it safely.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26284) Spark History server object vs file storage behavior difference

2021-05-19 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347496#comment-17347496
 ] 

Steve Loughran commented on SPARK-26284:


# s3:// URLs mean you are using EMR? If so: take it  up with them. Or, if you 
are using a very old version of hadoop, move up to a newer build with s3a. It 
still won't fix this problem, but you will be in supported code.

Now, please look at the comment above yours, and reread, especially the bit 
where I explain why this is WONTFIX. thx

> Spark History server object vs file storage behavior difference
> ---
>
> Key: SPARK-26284
> URL: https://issues.apache.org/jira/browse/SPARK-26284
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Priority: Minor
>
> I am using the spark history server in order to view running/complete jobs on 
> spark using the kubernetes scheduling backend introduced in 2.3.0. Using a 
> local file path in both {color:#33}{{spark.eventLog.dir}}{color} and 
> {{spark.history.fs.logDirectory}}, I have no issue seeing both incomplete and 
> completed tasks, with {{.inprogress}} files being flushed regularly. However, 
> when using an {{s3a://}} path, it seems the calls to flush the file 
> ([https://github.com/apache/spark/blob/dd518a196c2d40ae48034b8b0950d1c8045c02ed/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L152-L154)]
>  don't actually upload the file to s3. Due to this, I am unable to see 
> currently incomplete tasks using an s3a path.
> From the behavior I've observed, it only uploads on completion of the task 
> (hadoop 2.7) or upon the log file filling up the block size set for s3a 
> {{spark.hadoop.fs.s3a.multipart.size}} (hadoop 3.0.0). Is this intended 
> behavior?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35418) Add sentences function to functions.{scala,py}

2021-05-19 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-35418.

Fix Version/s: 3.2.0
   Resolution: Fixed

Resolved in https://github.com/apache/spark/pull/32566.

> Add sentences function to functions.{scala,py}
> --
>
> Key: SPARK-35418
> URL: https://issues.apache.org/jira/browse/SPARK-35418
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> Spark SQL provides sentences function but it can be used only from SQL.
> It's good if we can use it from Scala and Python code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35443) Mark K8s secrets and config maps as immutable

2021-05-19 Thread Ashray Jain (Jira)
Ashray Jain created SPARK-35443:
---

 Summary: Mark K8s secrets and config maps as immutable
 Key: SPARK-35443
 URL: https://issues.apache.org/jira/browse/SPARK-35443
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.1.1
Reporter: Ashray Jain


Kubernetes supports marking secrets and config maps as immutable to gain 
performance. 

[https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable]

[https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable]

For K8s clusters that run many thousands of Spark applications, this can yield 
significant reduction in load on the kube-apiserver.

>From the K8s docs:
{quote}For clusters that extensively use Secrets (at least tens of thousands of 
unique Secret to Pod mounts), preventing changes to their data has the 
following advantages:
 * protects you from accidental (or unwanted) updates that could cause 
applications outages
 * improves performance of your cluster by significantly reducing load on 
kube-apiserver, by closing watches for secrets marked as immutable.{quote}
 

For any secrets and config maps we create in Spark that are immutable, we could 
mark them as immutable by including the following when building the 
secret/config map
{code:java}
.withImmutable(true)
{code}
This feature has been supported in K8s as beta since K8s 1.19 and as GA since 
K8s 1.21



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35443) Mark K8s secrets and config maps as immutable

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35443:


Assignee: Apache Spark

> Mark K8s secrets and config maps as immutable
> -
>
> Key: SPARK-35443
> URL: https://issues.apache.org/jira/browse/SPARK-35443
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Ashray Jain
>Assignee: Apache Spark
>Priority: Minor
>  Labels: kubernetes
>
> Kubernetes supports marking secrets and config maps as immutable to gain 
> performance. 
> [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable]
> [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable]
> For K8s clusters that run many thousands of Spark applications, this can 
> yield significant reduction in load on the kube-apiserver.
> From the K8s docs:
> {quote}For clusters that extensively use Secrets (at least tens of thousands 
> of unique Secret to Pod mounts), preventing changes to their data has the 
> following advantages:
>  * protects you from accidental (or unwanted) updates that could cause 
> applications outages
>  * improves performance of your cluster by significantly reducing load on 
> kube-apiserver, by closing watches for secrets marked as immutable.{quote}
>  
> For any secrets and config maps we create in Spark that are immutable, we 
> could mark them as immutable by including the following when building the 
> secret/config map
> {code:java}
> .withImmutable(true)
> {code}
> This feature has been supported in K8s as beta since K8s 1.19 and as GA since 
> K8s 1.21



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35443) Mark K8s secrets and config maps as immutable

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347599#comment-17347599
 ] 

Apache Spark commented on SPARK-35443:
--

User 'ashrayjain' has created a pull request for this issue:
https://github.com/apache/spark/pull/32588

> Mark K8s secrets and config maps as immutable
> -
>
> Key: SPARK-35443
> URL: https://issues.apache.org/jira/browse/SPARK-35443
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Ashray Jain
>Priority: Minor
>  Labels: kubernetes
>
> Kubernetes supports marking secrets and config maps as immutable to gain 
> performance. 
> [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable]
> [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable]
> For K8s clusters that run many thousands of Spark applications, this can 
> yield significant reduction in load on the kube-apiserver.
> From the K8s docs:
> {quote}For clusters that extensively use Secrets (at least tens of thousands 
> of unique Secret to Pod mounts), preventing changes to their data has the 
> following advantages:
>  * protects you from accidental (or unwanted) updates that could cause 
> applications outages
>  * improves performance of your cluster by significantly reducing load on 
> kube-apiserver, by closing watches for secrets marked as immutable.{quote}
>  
> For any secrets and config maps we create in Spark that are immutable, we 
> could mark them as immutable by including the following when building the 
> secret/config map
> {code:java}
> .withImmutable(true)
> {code}
> This feature has been supported in K8s as beta since K8s 1.19 and as GA since 
> K8s 1.21



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35443) Mark K8s secrets and config maps as immutable

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35443:


Assignee: (was: Apache Spark)

> Mark K8s secrets and config maps as immutable
> -
>
> Key: SPARK-35443
> URL: https://issues.apache.org/jira/browse/SPARK-35443
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Ashray Jain
>Priority: Minor
>  Labels: kubernetes
>
> Kubernetes supports marking secrets and config maps as immutable to gain 
> performance. 
> [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable]
> [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable]
> For K8s clusters that run many thousands of Spark applications, this can 
> yield significant reduction in load on the kube-apiserver.
> From the K8s docs:
> {quote}For clusters that extensively use Secrets (at least tens of thousands 
> of unique Secret to Pod mounts), preventing changes to their data has the 
> following advantages:
>  * protects you from accidental (or unwanted) updates that could cause 
> applications outages
>  * improves performance of your cluster by significantly reducing load on 
> kube-apiserver, by closing watches for secrets marked as immutable.{quote}
>  
> For any secrets and config maps we create in Spark that are immutable, we 
> could mark them as immutable by including the following when building the 
> secret/config map
> {code:java}
> .withImmutable(true)
> {code}
> This feature has been supported in K8s as beta since K8s 1.19 and as GA since 
> K8s 1.21



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35443) Mark K8s secrets and config maps as immutable

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347600#comment-17347600
 ] 

Apache Spark commented on SPARK-35443:
--

User 'ashrayjain' has created a pull request for this issue:
https://github.com/apache/spark/pull/32588

> Mark K8s secrets and config maps as immutable
> -
>
> Key: SPARK-35443
> URL: https://issues.apache.org/jira/browse/SPARK-35443
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Ashray Jain
>Priority: Minor
>  Labels: kubernetes
>
> Kubernetes supports marking secrets and config maps as immutable to gain 
> performance. 
> [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable]
> [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable]
> For K8s clusters that run many thousands of Spark applications, this can 
> yield significant reduction in load on the kube-apiserver.
> From the K8s docs:
> {quote}For clusters that extensively use Secrets (at least tens of thousands 
> of unique Secret to Pod mounts), preventing changes to their data has the 
> following advantages:
>  * protects you from accidental (or unwanted) updates that could cause 
> applications outages
>  * improves performance of your cluster by significantly reducing load on 
> kube-apiserver, by closing watches for secrets marked as immutable.{quote}
>  
> For any secrets and config maps we create in Spark that are immutable, we 
> could mark them as immutable by including the following when building the 
> secret/config map
> {code:java}
> .withImmutable(true)
> {code}
> This feature has been supported in K8s as beta since K8s 1.19 and as GA since 
> K8s 1.21



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35444) Improve createTable logic if table exists

2021-05-19 Thread yikf (Jira)
yikf created SPARK-35444:


 Summary: Improve createTable logic if table exists
 Key: SPARK-35444
 URL: https://issues.apache.org/jira/browse/SPARK-35444
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: yikf
 Fix For: 3.0.0


It should return directly if table exists and ignoreIfExists=true



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35444) Improve createTable logic if table exists

2021-05-19 Thread yikf (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikf updated SPARK-35444:
-
Description: 
It should return directly if table exists and ignoreIfExists=true

 

current logic:
{code:java}
requireDbExists(db)
if (tableExists(newTableDefinition.identifier)) {
  if (!ignoreIfExists) {
throw new TableAlreadyExistsException(db = db, table = table)
  }
} else if (validateLocation) {
  validateTableLocation(newTableDefinition)
}
externalCatalog.createTable(newTableDefinition, ignoreIfExists)
{code}

  was:It should return directly if table exists and ignoreIfExists=true


> Improve createTable logic if table exists
> -
>
> Key: SPARK-35444
> URL: https://issues.apache.org/jira/browse/SPARK-35444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yikf
>Priority: Major
> Fix For: 3.0.0
>
>
> It should return directly if table exists and ignoreIfExists=true
>  
> current logic:
> {code:java}
> requireDbExists(db)
> if (tableExists(newTableDefinition.identifier)) {
>   if (!ignoreIfExists) {
> throw new TableAlreadyExistsException(db = db, table = table)
>   }
> } else if (validateLocation) {
>   validateTableLocation(newTableDefinition)
> }
> externalCatalog.createTable(newTableDefinition, ignoreIfExists)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35444) Improve createTable logic if table exists

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347619#comment-17347619
 ] 

Apache Spark commented on SPARK-35444:
--

User 'yikf' has created a pull request for this issue:
https://github.com/apache/spark/pull/32589

> Improve createTable logic if table exists
> -
>
> Key: SPARK-35444
> URL: https://issues.apache.org/jira/browse/SPARK-35444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yikf
>Priority: Major
> Fix For: 3.0.0
>
>
> It should return directly if table exists and ignoreIfExists=true
>  
> current logic:
> {code:java}
> requireDbExists(db)
> if (tableExists(newTableDefinition.identifier)) {
>   if (!ignoreIfExists) {
> throw new TableAlreadyExistsException(db = db, table = table)
>   }
> } else if (validateLocation) {
>   validateTableLocation(newTableDefinition)
> }
> externalCatalog.createTable(newTableDefinition, ignoreIfExists)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35444) Improve createTable logic if table exists

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35444:


Assignee: Apache Spark

> Improve createTable logic if table exists
> -
>
> Key: SPARK-35444
> URL: https://issues.apache.org/jira/browse/SPARK-35444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yikf
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.0.0
>
>
> It should return directly if table exists and ignoreIfExists=true
>  
> current logic:
> {code:java}
> requireDbExists(db)
> if (tableExists(newTableDefinition.identifier)) {
>   if (!ignoreIfExists) {
> throw new TableAlreadyExistsException(db = db, table = table)
>   }
> } else if (validateLocation) {
>   validateTableLocation(newTableDefinition)
> }
> externalCatalog.createTable(newTableDefinition, ignoreIfExists)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35444) Improve createTable logic if table exists

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35444:


Assignee: (was: Apache Spark)

> Improve createTable logic if table exists
> -
>
> Key: SPARK-35444
> URL: https://issues.apache.org/jira/browse/SPARK-35444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yikf
>Priority: Major
> Fix For: 3.0.0
>
>
> It should return directly if table exists and ignoreIfExists=true
>  
> current logic:
> {code:java}
> requireDbExists(db)
> if (tableExists(newTableDefinition.identifier)) {
>   if (!ignoreIfExists) {
> throw new TableAlreadyExistsException(db = db, table = table)
>   }
> } else if (validateLocation) {
>   validateTableLocation(newTableDefinition)
> }
> externalCatalog.createTable(newTableDefinition, ignoreIfExists)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35444) Improve createTable logic if table exists

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347620#comment-17347620
 ] 

Apache Spark commented on SPARK-35444:
--

User 'yikf' has created a pull request for this issue:
https://github.com/apache/spark/pull/32589

> Improve createTable logic if table exists
> -
>
> Key: SPARK-35444
> URL: https://issues.apache.org/jira/browse/SPARK-35444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yikf
>Priority: Major
> Fix For: 3.0.0
>
>
> It should return directly if table exists and ignoreIfExists=true
>  
> current logic:
> {code:java}
> requireDbExists(db)
> if (tableExists(newTableDefinition.identifier)) {
>   if (!ignoreIfExists) {
> throw new TableAlreadyExistsException(db = db, table = table)
>   }
> } else if (validateLocation) {
>   validateTableLocation(newTableDefinition)
> }
> externalCatalog.createTable(newTableDefinition, ignoreIfExists)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35362) Update null count in the column stats for UNION stats estimation

2021-05-19 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-35362.
--
Fix Version/s: 3.2.0
 Assignee: shahid
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/32494

> Update null count in the column stats for UNION stats estimation
> 
>
> Key: SPARK-35362
> URL: https://issues.apache.org/jira/browse/SPARK-35362
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.0.2
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35093) AQE columnar mismatch on exchange reuse

2021-05-19 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-35093.
---
Fix Version/s: 3.2.0
   3.1.2
   3.0.3
 Assignee: Andy Grove
   Resolution: Fixed

> AQE columnar mismatch on exchange reuse
> ---
>
> Key: SPARK-35093
> URL: https://issues.apache.org/jira/browse/SPARK-35093
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that 
> are semantically equal.
> This is done by comparing the canonicalized plan for two Exchange nodes to 
> see if they are the same.
> Unfortunately this does not take into account the fact that two exchanges 
> with the same canonical plan might be replaced by a plugin in a way that 
> makes them not compatible. For example, a plugin could create one version 
> with supportsColumnar=true and another with supportsColumnar=false. It is not 
> valid to re-use exchanges if there is a supportsColumnar mismatch.
> I have tested a fix for this and will put up a PR once I figure out how to 
> write the tests.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35445) Reduce the execution time of DeduplicateRelations

2021-05-19 Thread wuyi (Jira)
wuyi created SPARK-35445:


 Summary: Reduce the execution time of DeduplicateRelations
 Key: SPARK-35445
 URL: https://issues.apache.org/jira/browse/SPARK-35445
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: wuyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35373) Verify checksums of downloaded artifacts in build/mvn

2021-05-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-35373:
-
Fix Version/s: 3.1.2
   3.0.3

> Verify checksums of downloaded artifacts in build/mvn
> -
>
> Key: SPARK-35373
> URL: https://issues.apache.org/jira/browse/SPARK-35373
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Sean R. Owen
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> build/mvn is a convenience script that will automatically download Maven (and 
> Scala) if not already present. While it downloads from official ASF mirrors, 
> it does not check the checksum of the artifact, which is available as a 
> .sha512 file from ASF servers.
> The risk of a supply chain attack is a bit less theoretical here than usual, 
> because artifacts are downloaded from any of several mirrors worldwide, and 
> injecting a malicious copy of Maven in any one of them might be simpler and 
> less noticeable than injecting it into ASF servers.
> (Note, Scala's download site does not seem to provide a checksum. They do all 
> come from Lightbend, at least, not N mirrors. Not much we can do there.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35445) Reduce the execution time of DeduplicateRelations

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35445:


Assignee: Apache Spark

> Reduce the execution time of DeduplicateRelations
> -
>
> Key: SPARK-35445
> URL: https://issues.apache.org/jira/browse/SPARK-35445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35445) Reduce the execution time of DeduplicateRelations

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35445:


Assignee: (was: Apache Spark)

> Reduce the execution time of DeduplicateRelations
> -
>
> Key: SPARK-35445
> URL: https://issues.apache.org/jira/browse/SPARK-35445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35445) Reduce the execution time of DeduplicateRelations

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347706#comment-17347706
 ] 

Apache Spark commented on SPARK-35445:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/32590

> Reduce the execution time of DeduplicateRelations
> -
>
> Key: SPARK-35445
> URL: https://issues.apache.org/jira/browse/SPARK-35445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35445) Reduce the execution time of DeduplicateRelations

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347704#comment-17347704
 ] 

Apache Spark commented on SPARK-35445:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/32590

> Reduce the execution time of DeduplicateRelations
> -
>
> Key: SPARK-35445
> URL: https://issues.apache.org/jira/browse/SPARK-35445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35446) Override getJDBCType in MySQLDialect to map FloatType to FLOAT

2021-05-19 Thread Marios Meimaris (Jira)
Marios Meimaris created SPARK-35446:
---

 Summary: Override getJDBCType in MySQLDialect to map FloatType to 
FLOAT
 Key: SPARK-35446
 URL: https://issues.apache.org/jira/browse/SPARK-35446
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: Marios Meimaris


In 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L165,]
 FloatType is mapped to REAL. However, MySQL treats REAL as a synonym to DOUBLE 
by default (see [https://dev.mysql.com/doc/refman/8.0/en/numeric-types.html] 
for more information). 

`MySQLDialect` could override `getJDBCType` so that it maps FloatType to FLOAT 
instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35373) Verify checksums of downloaded artifacts in build/mvn

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347765#comment-17347765
 ] 

Apache Spark commented on SPARK-35373:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32591

> Verify checksums of downloaded artifacts in build/mvn
> -
>
> Key: SPARK-35373
> URL: https://issues.apache.org/jira/browse/SPARK-35373
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Sean R. Owen
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> build/mvn is a convenience script that will automatically download Maven (and 
> Scala) if not already present. While it downloads from official ASF mirrors, 
> it does not check the checksum of the artifact, which is available as a 
> .sha512 file from ASF servers.
> The risk of a supply chain attack is a bit less theoretical here than usual, 
> because artifacts are downloaded from any of several mirrors worldwide, and 
> injecting a malicious copy of Maven in any one of them might be simpler and 
> less noticeable than injecting it into ASF servers.
> (Note, Scala's download site does not seem to provide a checksum. They do all 
> come from Lightbend, at least, not N mirrors. Not much we can do there.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35343) Make conversion from/to pandas data-type-based

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35343:


Assignee: Apache Spark

> Make conversion from/to pandas data-type-based
> --
>
> Key: SPARK-35343
> URL: https://issues.apache.org/jira/browse/SPARK-35343
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35343) Make conversion from/to pandas data-type-based

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347808#comment-17347808
 ] 

Apache Spark commented on SPARK-35343:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/32592

> Make conversion from/to pandas data-type-based
> --
>
> Key: SPARK-35343
> URL: https://issues.apache.org/jira/browse/SPARK-35343
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35343) Make conversion from/to pandas data-type-based

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347809#comment-17347809
 ] 

Apache Spark commented on SPARK-35343:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/32592

> Make conversion from/to pandas data-type-based
> --
>
> Key: SPARK-35343
> URL: https://issues.apache.org/jira/browse/SPARK-35343
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35343) Make conversion from/to pandas data-type-based

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35343:


Assignee: (was: Apache Spark)

> Make conversion from/to pandas data-type-based
> --
>
> Key: SPARK-35343
> URL: https://issues.apache.org/jira/browse/SPARK-35343
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35447) optimize skew join before coalescing shuffle partitions

2021-05-19 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-35447:
---

 Summary: optimize skew join before coalescing shuffle partitions
 Key: SPARK-35447
 URL: https://issues.apache.org/jira/browse/SPARK-35447
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32304) Flaky Test: AdaptiveQueryExecSuite.multiple joins

2021-05-19 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347817#comment-17347817
 ] 

Andy Grove commented on SPARK-32304:


[~dongjoon] This is the issue we just saw in branch-3.0.

 

 

> Flaky Test: AdaptiveQueryExecSuite.multiple joins
> -
>
> Key: SPARK-32304
> URL: https://issues.apache.org/jira/browse/SPARK-32304
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> AdaptiveQueryExecSuite:
> - Change merge join to broadcast join (313 milliseconds)
> - Reuse the parallelism of CoalescedShuffleReaderExec in 
> LocalShuffleReaderExec (265 milliseconds)
> - Reuse the default parallelism in LocalShuffleReaderExec (230 milliseconds)
> - Empty stage coalesced to 0-partition RDD (514 milliseconds)
> - Scalar subquery (406 milliseconds)
> - Scalar subquery in later stages (500 milliseconds)
> - multiple joins *** FAILED *** (739 milliseconds)
>   ArrayBuffer(BroadcastHashJoin [b#251429], [a#251438], Inner, BuildLeft
>   :- BroadcastQueryStage 5
>   :  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[3, int, 
> false] as bigint))), [id=#504817]
>   : +- CustomShuffleReader local
>   :+- ShuffleQueryStage 4
>   :   +- Exchange hashpartitioning(b#251429, 5), true, [id=#504777]
>   :  +- *(7) SortMergeJoin [key#251418], [a#251428], Inner
>   : :- *(5) Sort [key#251418 ASC NULLS FIRST], false, 0
>   : :  +- CustomShuffleReader coalesced
>   : : +- ShuffleQueryStage 0
>   : :+- Exchange hashpartitioning(key#251418, 5), 
> true, [id=#504656]
>   : :   +- *(1) Filter (isnotnull(value#251419) AND 
> (cast(value#251419 as int) = 1))
>   : :  +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#251418, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) 
> AS value#251419]
>   : : +- Scan[obj#251417]
>   : +- *(6) Sort [a#251428 ASC NULLS FIRST], false, 0
>   :+- CustomShuffleReader coalesced
>   :   +- ShuffleQueryStage 1
>   :  +- Exchange hashpartitioning(a#251428, 5), true, 
> [id=#504663]
>   : +- *(2) Filter (b#251429 = 1)
>   :+- *(2) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#251428, 
> knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#251429]
>   :   +- Scan[obj#251427]
>   +- BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight
>  :- CustomShuffleReader local
>  :  +- ShuffleQueryStage 2
>  : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680]
>  :+- *(3) Filter (n#251498 = 1)
>  :   +- *(3) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).l, true, false) 
> AS l#251499]
>  :  +- Scan[obj#251497]
>  +- BroadcastQueryStage 6
> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
> int, false] as bigint))), [id=#504830]
>+- CustomShuffleReader local
>   +- ShuffleQueryStage 3
>  +- Exchange hashpartitioning(a#251438, 5), true, [id=#504694]
> +- *(4) Filter (a#251438 = 1)
>+- *(4) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#251438, 
> unwrapoption(IntegerType, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData3, true])).b) AS b#251439]
>   +- Scan[obj#251437]
>   , BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight
>   :- CustomShuffleReader local
>   :  +- ShuffleQueryStage 2
>   : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680]
>   :+- *(3) Filter (n#251498 = 1)
>   :   +- *(3) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$Lowe

[jira] [Assigned] (SPARK-35447) optimize skew join before coalescing shuffle partitions

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35447:


Assignee: (was: Apache Spark)

> optimize skew join before coalescing shuffle partitions
> ---
>
> Key: SPARK-35447
> URL: https://issues.apache.org/jira/browse/SPARK-35447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35447) optimize skew join before coalescing shuffle partitions

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347819#comment-17347819
 ] 

Apache Spark commented on SPARK-35447:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32594

> optimize skew join before coalescing shuffle partitions
> ---
>
> Key: SPARK-35447
> URL: https://issues.apache.org/jira/browse/SPARK-35447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35447) optimize skew join before coalescing shuffle partitions

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35447:


Assignee: Apache Spark

> optimize skew join before coalescing shuffle partitions
> ---
>
> Key: SPARK-35447
> URL: https://issues.apache.org/jira/browse/SPARK-35447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35447) optimize skew join before coalescing shuffle partitions

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347820#comment-17347820
 ] 

Apache Spark commented on SPARK-35447:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32594

> optimize skew join before coalescing shuffle partitions
> ---
>
> Key: SPARK-35447
> URL: https://issues.apache.org/jira/browse/SPARK-35447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35448) Subexpression elimination enhancements

2021-05-19 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-35448:
---

 Summary: Subexpression elimination enhancements
 Key: SPARK-35448
 URL: https://issues.apache.org/jira/browse/SPARK-35448
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.2.0
Reporter: L. C. Hsieh






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35439) Children subexpr should come first than parent subexpr in subexpression elimination

2021-05-19 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-35439:

Parent: SPARK-35448
Issue Type: Sub-task  (was: Improvement)

> Children subexpr should come first than parent subexpr in subexpression 
> elimination
> ---
>
> Key: SPARK-35439
> URL: https://issues.apache.org/jira/browse/SPARK-35439
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> EquivalentExpressions maintains a map of equivalent expressions. It is 
> HashMap now so the insertion order is not guaranteed to be preserved later. 
> Subexpression elimination relies on retrieving subexpressions from the map. 
> If there is child-parent relationships among the subexpressions, we want the 
> child expressions come first than parent expressions, so we can replace child 
> expressions in parent expressions with subexpression evaluation.
> For example, we have two different expressions Add(Literal(1), Literal(2)) 
> and Add(Literal(3), add).
> Case 1: child subexpr comes first. Replacing HashMap with LinkedHashMap can 
> deal with it.
> addExprTree(add)
> addExprTree(Add(Literal(3), add))
> addExprTree(Add(Literal(3), add))
> Case 2: parent subexpr comes first. For this case, we need to sort equivalent 
> expressions.
> addExprTree(Add(Literal(3), add))  => We add `Add(Literal(3), add)` into the 
> map first, then add `add` into the map
> addExprTree(add)
> addExprTree(Add(Literal(3), add))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35410) Unused subexpressions leftover in WholeStageCodegen subexpression elimination

2021-05-19 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-35410:

Parent: SPARK-35448
Issue Type: Sub-task  (was: Bug)

> Unused subexpressions leftover in WholeStageCodegen subexpression elimination
> -
>
> Key: SPARK-35410
> URL: https://issues.apache.org/jira/browse/SPARK-35410
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
> Attachments: codegen.txt
>
>
> Trying to understand and debug the performance of some of our jobs, I started 
> digging into what the whole stage codegen code was doing. We use a lot of 
> case when statements, and I found that there were a lot of unused sub 
> expressions that were left over after the subexpression elimination, and it 
> gets worse the more expressions you have chained. The simple example:
> {code:java}
> import org.apache.spark.sql.functions._
> import spark.implicits._
> val myUdf = udf((s: String) => {
>   println("In UDF")
>   s.toUpperCase
> })
> spark.range(5).select(when(length(myUdf($"id")) > 0, 
> length(myUdf($"id".show()
> {code}
> Running the code, you'll see "In UDF" printed out 10 times. And if you change 
> both to log(length(myUdf($"id")), "In UDF" will print out 20 times (one more 
> for a cast from int to double and one more for the actual log calculation I 
> think).
> In the codegen for this (without the log), there are these initial 
> subexpressions:
> {code:java}
> /* 076 */ UTF8String project_subExprValue_0 = 
> project_subExpr_0(project_expr_0_0);
> /* 077 */ int project_subExprValue_1 = 
> project_subExpr_1(project_expr_0_0);
> /* 078 */ UTF8String project_subExprValue_2 = 
> project_subExpr_2(project_expr_0_0);
> {code}
> project_subExprValue_0 and project_subExprValue_2 are never actually used, so 
> it's properly resolving the two expressions and sharing the result of 
> project_subExprValue_1, but it's not removing the other sub expression calls 
> it seems like.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen

2021-05-19 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347842#comment-17347842
 ] 

L. C. Hsieh commented on SPARK-35449:
-

cc [~Kimahriman]

> Should not extract common expressions from value expressions when elseValue 
> is empty in CaseWhen
> 
>
> Key: SPARK-35449
> URL: https://issues.apache.org/jira/browse/SPARK-35449
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen

2021-05-19 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-35449:
---

 Summary: Should not extract common expressions from value 
expressions when elseValue is empty in CaseWhen
 Key: SPARK-35449
 URL: https://issues.apache.org/jira/browse/SPARK-35449
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: L. C. Hsieh






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32304) Flaky Test: AdaptiveQueryExecSuite.multiple joins

2021-05-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347871#comment-17347871
 ] 

Dongjoon Hyun commented on SPARK-32304:
---

Oh, got it. Thank you, [~andygrove]!

> Flaky Test: AdaptiveQueryExecSuite.multiple joins
> -
>
> Key: SPARK-32304
> URL: https://issues.apache.org/jira/browse/SPARK-32304
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> AdaptiveQueryExecSuite:
> - Change merge join to broadcast join (313 milliseconds)
> - Reuse the parallelism of CoalescedShuffleReaderExec in 
> LocalShuffleReaderExec (265 milliseconds)
> - Reuse the default parallelism in LocalShuffleReaderExec (230 milliseconds)
> - Empty stage coalesced to 0-partition RDD (514 milliseconds)
> - Scalar subquery (406 milliseconds)
> - Scalar subquery in later stages (500 milliseconds)
> - multiple joins *** FAILED *** (739 milliseconds)
>   ArrayBuffer(BroadcastHashJoin [b#251429], [a#251438], Inner, BuildLeft
>   :- BroadcastQueryStage 5
>   :  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[3, int, 
> false] as bigint))), [id=#504817]
>   : +- CustomShuffleReader local
>   :+- ShuffleQueryStage 4
>   :   +- Exchange hashpartitioning(b#251429, 5), true, [id=#504777]
>   :  +- *(7) SortMergeJoin [key#251418], [a#251428], Inner
>   : :- *(5) Sort [key#251418 ASC NULLS FIRST], false, 0
>   : :  +- CustomShuffleReader coalesced
>   : : +- ShuffleQueryStage 0
>   : :+- Exchange hashpartitioning(key#251418, 5), 
> true, [id=#504656]
>   : :   +- *(1) Filter (isnotnull(value#251419) AND 
> (cast(value#251419 as int) = 1))
>   : :  +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#251418, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) 
> AS value#251419]
>   : : +- Scan[obj#251417]
>   : +- *(6) Sort [a#251428 ASC NULLS FIRST], false, 0
>   :+- CustomShuffleReader coalesced
>   :   +- ShuffleQueryStage 1
>   :  +- Exchange hashpartitioning(a#251428, 5), true, 
> [id=#504663]
>   : +- *(2) Filter (b#251429 = 1)
>   :+- *(2) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#251428, 
> knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#251429]
>   :   +- Scan[obj#251427]
>   +- BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight
>  :- CustomShuffleReader local
>  :  +- ShuffleQueryStage 2
>  : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680]
>  :+- *(3) Filter (n#251498 = 1)
>  :   +- *(3) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).l, true, false) 
> AS l#251499]
>  :  +- Scan[obj#251497]
>  +- BroadcastQueryStage 6
> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
> int, false] as bigint))), [id=#504830]
>+- CustomShuffleReader local
>   +- ShuffleQueryStage 3
>  +- Exchange hashpartitioning(a#251438, 5), true, [id=#504694]
> +- *(4) Filter (a#251438 = 1)
>+- *(4) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#251438, 
> unwrapoption(IntegerType, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData3, true])).b) AS b#251439]
>   +- Scan[obj#251437]
>   , BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight
>   :- CustomShuffleReader local
>   :  +- ShuffleQueryStage 2
>   : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680]
>   :+- *(3) Filter (n#251498 = 1)
>   :   +- *(3) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n

[jira] [Updated] (SPARK-32304) Flaky Test: AdaptiveQueryExecSuite.multiple joins

2021-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32304:
--
Parent: SPARK-33828
Issue Type: Sub-task  (was: Bug)

> Flaky Test: AdaptiveQueryExecSuite.multiple joins
> -
>
> Key: SPARK-32304
> URL: https://issues.apache.org/jira/browse/SPARK-32304
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> AdaptiveQueryExecSuite:
> - Change merge join to broadcast join (313 milliseconds)
> - Reuse the parallelism of CoalescedShuffleReaderExec in 
> LocalShuffleReaderExec (265 milliseconds)
> - Reuse the default parallelism in LocalShuffleReaderExec (230 milliseconds)
> - Empty stage coalesced to 0-partition RDD (514 milliseconds)
> - Scalar subquery (406 milliseconds)
> - Scalar subquery in later stages (500 milliseconds)
> - multiple joins *** FAILED *** (739 milliseconds)
>   ArrayBuffer(BroadcastHashJoin [b#251429], [a#251438], Inner, BuildLeft
>   :- BroadcastQueryStage 5
>   :  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[3, int, 
> false] as bigint))), [id=#504817]
>   : +- CustomShuffleReader local
>   :+- ShuffleQueryStage 4
>   :   +- Exchange hashpartitioning(b#251429, 5), true, [id=#504777]
>   :  +- *(7) SortMergeJoin [key#251418], [a#251428], Inner
>   : :- *(5) Sort [key#251418 ASC NULLS FIRST], false, 0
>   : :  +- CustomShuffleReader coalesced
>   : : +- ShuffleQueryStage 0
>   : :+- Exchange hashpartitioning(key#251418, 5), 
> true, [id=#504656]
>   : :   +- *(1) Filter (isnotnull(value#251419) AND 
> (cast(value#251419 as int) = 1))
>   : :  +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#251418, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) 
> AS value#251419]
>   : : +- Scan[obj#251417]
>   : +- *(6) Sort [a#251428 ASC NULLS FIRST], false, 0
>   :+- CustomShuffleReader coalesced
>   :   +- ShuffleQueryStage 1
>   :  +- Exchange hashpartitioning(a#251428, 5), true, 
> [id=#504663]
>   : +- *(2) Filter (b#251429 = 1)
>   :+- *(2) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#251428, 
> knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#251429]
>   :   +- Scan[obj#251427]
>   +- BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight
>  :- CustomShuffleReader local
>  :  +- ShuffleQueryStage 2
>  : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680]
>  :+- *(3) Filter (n#251498 = 1)
>  :   +- *(3) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).l, true, false) 
> AS l#251499]
>  :  +- Scan[obj#251497]
>  +- BroadcastQueryStage 6
> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
> int, false] as bigint))), [id=#504830]
>+- CustomShuffleReader local
>   +- ShuffleQueryStage 3
>  +- Exchange hashpartitioning(a#251438, 5), true, [id=#504694]
> +- *(4) Filter (a#251438 = 1)
>+- *(4) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#251438, 
> unwrapoption(IntegerType, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData3, true])).b) AS b#251439]
>   +- Scan[obj#251437]
>   , BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight
>   :- CustomShuffleReader local
>   :  +- ShuffleQueryStage 2
>   : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680]
>   :+- *(3) Filter (n#251498 = 1)
>   :   +- *(3) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, 
> staticinvoke(

[jira] [Updated] (SPARK-32304) Flaky Test: AdaptiveQueryExecSuite.multiple joins

2021-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32304:
--
Parent: (was: SPARK-32244)
Issue Type: Bug  (was: Sub-task)

> Flaky Test: AdaptiveQueryExecSuite.multiple joins
> -
>
> Key: SPARK-32304
> URL: https://issues.apache.org/jira/browse/SPARK-32304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> AdaptiveQueryExecSuite:
> - Change merge join to broadcast join (313 milliseconds)
> - Reuse the parallelism of CoalescedShuffleReaderExec in 
> LocalShuffleReaderExec (265 milliseconds)
> - Reuse the default parallelism in LocalShuffleReaderExec (230 milliseconds)
> - Empty stage coalesced to 0-partition RDD (514 milliseconds)
> - Scalar subquery (406 milliseconds)
> - Scalar subquery in later stages (500 milliseconds)
> - multiple joins *** FAILED *** (739 milliseconds)
>   ArrayBuffer(BroadcastHashJoin [b#251429], [a#251438], Inner, BuildLeft
>   :- BroadcastQueryStage 5
>   :  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[3, int, 
> false] as bigint))), [id=#504817]
>   : +- CustomShuffleReader local
>   :+- ShuffleQueryStage 4
>   :   +- Exchange hashpartitioning(b#251429, 5), true, [id=#504777]
>   :  +- *(7) SortMergeJoin [key#251418], [a#251428], Inner
>   : :- *(5) Sort [key#251418 ASC NULLS FIRST], false, 0
>   : :  +- CustomShuffleReader coalesced
>   : : +- ShuffleQueryStage 0
>   : :+- Exchange hashpartitioning(key#251418, 5), 
> true, [id=#504656]
>   : :   +- *(1) Filter (isnotnull(value#251419) AND 
> (cast(value#251419 as int) = 1))
>   : :  +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#251418, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) 
> AS value#251419]
>   : : +- Scan[obj#251417]
>   : +- *(6) Sort [a#251428 ASC NULLS FIRST], false, 0
>   :+- CustomShuffleReader coalesced
>   :   +- ShuffleQueryStage 1
>   :  +- Exchange hashpartitioning(a#251428, 5), true, 
> [id=#504663]
>   : +- *(2) Filter (b#251429 = 1)
>   :+- *(2) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#251428, 
> knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#251429]
>   :   +- Scan[obj#251427]
>   +- BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight
>  :- CustomShuffleReader local
>  :  +- ShuffleQueryStage 2
>  : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680]
>  :+- *(3) Filter (n#251498 = 1)
>  :   +- *(3) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).l, true, false) 
> AS l#251499]
>  :  +- Scan[obj#251497]
>  +- BroadcastQueryStage 6
> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
> int, false] as bigint))), [id=#504830]
>+- CustomShuffleReader local
>   +- ShuffleQueryStage 3
>  +- Exchange hashpartitioning(a#251438, 5), true, [id=#504694]
> +- *(4) Filter (a#251438 = 1)
>+- *(4) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#251438, 
> unwrapoption(IntegerType, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData3, true])).b) AS b#251439]
>   +- Scan[obj#251437]
>   , BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight
>   :- CustomShuffleReader local
>   :  +- ShuffleQueryStage 2
>   : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680]
>   :+- *(3) Filter (n#251498 = 1)
>   :   +- *(3) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, 
> statici

[jira] [Reopened] (SPARK-32304) Flaky Test: AdaptiveQueryExecSuite.multiple joins

2021-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-32304:
---

I reopen and convert to a subtask of SPARK-33828 .
If there is no more occurrence during Apache Spark 3.2.0 QA period, we can 
close this.

> Flaky Test: AdaptiveQueryExecSuite.multiple joins
> -
>
> Key: SPARK-32304
> URL: https://issues.apache.org/jira/browse/SPARK-32304
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> AdaptiveQueryExecSuite:
> - Change merge join to broadcast join (313 milliseconds)
> - Reuse the parallelism of CoalescedShuffleReaderExec in 
> LocalShuffleReaderExec (265 milliseconds)
> - Reuse the default parallelism in LocalShuffleReaderExec (230 milliseconds)
> - Empty stage coalesced to 0-partition RDD (514 milliseconds)
> - Scalar subquery (406 milliseconds)
> - Scalar subquery in later stages (500 milliseconds)
> - multiple joins *** FAILED *** (739 milliseconds)
>   ArrayBuffer(BroadcastHashJoin [b#251429], [a#251438], Inner, BuildLeft
>   :- BroadcastQueryStage 5
>   :  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[3, int, 
> false] as bigint))), [id=#504817]
>   : +- CustomShuffleReader local
>   :+- ShuffleQueryStage 4
>   :   +- Exchange hashpartitioning(b#251429, 5), true, [id=#504777]
>   :  +- *(7) SortMergeJoin [key#251418], [a#251428], Inner
>   : :- *(5) Sort [key#251418 ASC NULLS FIRST], false, 0
>   : :  +- CustomShuffleReader coalesced
>   : : +- ShuffleQueryStage 0
>   : :+- Exchange hashpartitioning(key#251418, 5), 
> true, [id=#504656]
>   : :   +- *(1) Filter (isnotnull(value#251419) AND 
> (cast(value#251419 as int) = 1))
>   : :  +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#251418, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) 
> AS value#251419]
>   : : +- Scan[obj#251417]
>   : +- *(6) Sort [a#251428 ASC NULLS FIRST], false, 0
>   :+- CustomShuffleReader coalesced
>   :   +- ShuffleQueryStage 1
>   :  +- Exchange hashpartitioning(a#251428, 5), true, 
> [id=#504663]
>   : +- *(2) Filter (b#251429 = 1)
>   :+- *(2) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#251428, 
> knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#251429]
>   :   +- Scan[obj#251427]
>   +- BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight
>  :- CustomShuffleReader local
>  :  +- ShuffleQueryStage 2
>  : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680]
>  :+- *(3) Filter (n#251498 = 1)
>  :   +- *(3) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).l, true, false) 
> AS l#251499]
>  :  +- Scan[obj#251497]
>  +- BroadcastQueryStage 6
> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
> int, false] as bigint))), [id=#504830]
>+- CustomShuffleReader local
>   +- ShuffleQueryStage 3
>  +- Exchange hashpartitioning(a#251438, 5), true, [id=#504694]
> +- *(4) Filter (a#251438 = 1)
>+- *(4) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#251438, 
> unwrapoption(IntegerType, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData3, true])).b) AS b#251439]
>   +- Scan[obj#251437]
>   , BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight
>   :- CustomShuffleReader local
>   :  +- ShuffleQueryStage 2
>   : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680]
>   :+- *(3) Filter (n#251498 = 1)
>   :   +- *(3) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spar

[jira] [Resolved] (SPARK-35338) Separate arithmetic operations into data type based structures

2021-05-19 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-35338.
---
Fix Version/s: 3.2.0
 Assignee: Xinrong Meng
   Resolution: Fixed

Issue resolved by pull request 32469
https://github.com/apache/spark/pull/32469

> Separate arithmetic operations into data type based structures
> --
>
> Key: SPARK-35338
> URL: https://issues.apache.org/jira/browse/SPARK-35338
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> For arithmetic operations:
> {code:java}
> __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
> __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__
> {code}
> we would like to separate them into data-type-based structures.
> The existing behaviors of each arithmetic operation should be preserved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35256) Subexpression elimination leading to a performance regression

2021-05-19 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347974#comment-17347974
 ] 

Adam Binford commented on SPARK-35256:
--

See https://issues.apache.org/jira/browse/SPARK-35448 for a few related issues 
for subexpression elimination being worked. Specifically 
https://issues.apache.org/jira/browse/SPARK-35410 is probably the main issue 
you're facing where `when` expressions can cause a lot of extra, unused 
subexpressions to be included in the codegen

> Subexpression elimination leading to a performance regression
> -
>
> Key: SPARK-35256
> URL: https://issues.apache.org/jira/browse/SPARK-35256
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ondrej Kokes
>Priority: Minor
> Attachments: bisect_log.txt, bisect_timing.csv
>
>
> I'm seeing almost double the runtime between 3.0.1 and 3.1.1 in my pipeline 
> that does mostly str_to_map, split and a few other operations - all 
> projections, no joins or aggregations (it's here only to trigger the 
> pipeline). I cut it down to the simplest reproducible example I could - 
> anything I remove from this changes the runtime difference quite 
> dramatically. (even moving those two expressions from f.when to standalone 
> columns makes the difference disappear)
> {code:java}
> import time
> import os
> import pyspark  
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> if __name__ == '__main__':
> print(pyspark.__version__)
> spark = SparkSession.builder.getOrCreate()
> filename = 'regression.csv'
> if not os.path.isfile(filename):
> with open(filename, 'wt') as fw:
> fw.write('foo\n')
> for _ in range(10_000_000):
> fw.write('foo=bar&baz=bak&bar=f,o,1:2:3\n')
> df = spark.read.option('header', True).csv(filename)
> t = time.time()
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> .withColumn('extracted',
> # without this top level split it is only 50% 
> slower, with it
> # the runtime almost doubles
> f.split(f.split(f.col("my_map")["bar"], ",")[2], 
> ":")[0]
>)
> .select(
> f.when(
> f.col("extracted").startswith("foo"), f.col("extracted")
> ).otherwise(
> f.concat(f.lit("foo"), f.col("extracted"))
> ).alias("foo")
> )
> )
> # dd.explain(True)
> _ = dd.groupby("foo").count().count()
> print("elapsed", time.time() - t)
> {code}
> Running this in 3.0.1 and 3.1.1 respectively (both installed from PyPI, on my 
> local macOS)
> {code:java}
> 3.0.1
> elapsed 21.262351036071777
> 3.1.1
> elapsed 40.26582884788513
> {code}
> (Meaning the transformation took 21 seconds in 3.0.1 and 40 seconds in 3.1.1)
> Feel free to make the CSV smaller to get a quicker feedback loop - it scales 
> linearly (I developed this with 2M rows).
> It might be related to my previous issue - SPARK-32989 - there are similar 
> operations, nesting etc. (splitting on the original column, not on a map, 
> makes the difference disappear)
> I tried dissecting the queries in SparkUI and via explain, but both 3.0.1 and 
> 3.1.1 produced identical plans.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35448) Subexpression elimination enhancements

2021-05-19 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347976#comment-17347976
 ] 

Adam Binford commented on SPARK-35448:
--

Just spitballing after looking through some of the subexpression elimination 
code. Does it make more sense to treat the output of EquivalentExpressions as 
"candidates" for subexpressions, and then it's the codegen's job to figure out 
which subexpressions are actually needed and used and include the code for them?

> Subexpression elimination enhancements
> --
>
> Key: SPARK-35448
> URL: https://issues.apache.org/jira/browse/SPARK-35448
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-35448) Subexpression elimination enhancements

2021-05-19 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347976#comment-17347976
 ] 

Adam Binford edited comment on SPARK-35448 at 5/19/21, 11:21 PM:
-

Just spitballing after looking through some of the subexpression elimination 
code. Does it make any sense to treat the output of EquivalentExpressions as 
"candidates" for subexpressions, and then it's the codegen's job to figure out 
which subexpressions are actually needed and used and include the code for them?


was (Author: kimahriman):
Just spitballing after looking through some of the subexpression elimination 
code. Does it make more sense to treat the output of EquivalentExpressions as 
"candidates" for subexpressions, and then it's the codegen's job to figure out 
which subexpressions are actually needed and used and include the code for them?

> Subexpression elimination enhancements
> --
>
> Key: SPARK-35448
> URL: https://issues.apache.org/jira/browse/SPARK-35448
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35438) Minor documentation fix for window physical operator

2021-05-19 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-35438.
--
Fix Version/s: 3.2.0
 Assignee: Cheng Su
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/32585

> Minor documentation fix for window physical operator
> 
>
> Key: SPARK-35438
> URL: https://issues.apache.org/jira/browse/SPARK-35438
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.2.0
>
>
> As title. Fixed two places where the documentation has some error. Help 
> people read code more easily in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-35338) Separate arithmetic operations into data type based structures

2021-05-19 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reopened SPARK-35338:
---

The PR was reverted at 
https://github.com/apache/spark/commit/d44e6c7f10528556dc5a64527e6f67e2ae7947fc 
due to Python linter issue.

> Separate arithmetic operations into data type based structures
> --
>
> Key: SPARK-35338
> URL: https://issues.apache.org/jira/browse/SPARK-35338
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> For arithmetic operations:
> {code:java}
> __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
> __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__
> {code}
> we would like to separate them into data-type-based structures.
> The existing behaviors of each arithmetic operation should be preserved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35338) Separate arithmetic operations into data type based structures

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35338:


Assignee: Apache Spark  (was: Xinrong Meng)

> Separate arithmetic operations into data type based structures
> --
>
> Key: SPARK-35338
> URL: https://issues.apache.org/jira/browse/SPARK-35338
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>
> For arithmetic operations:
> {code:java}
> __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
> __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__
> {code}
> we would like to separate them into data-type-based structures.
> The existing behaviors of each arithmetic operation should be preserved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35338) Separate arithmetic operations into data type based structures

2021-05-19 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-35338:
--
Fix Version/s: (was: 3.2.0)

> Separate arithmetic operations into data type based structures
> --
>
> Key: SPARK-35338
> URL: https://issues.apache.org/jira/browse/SPARK-35338
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> For arithmetic operations:
> {code:java}
> __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
> __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__
> {code}
> we would like to separate them into data-type-based structures.
> The existing behaviors of each arithmetic operation should be preserved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35338) Separate arithmetic operations into data type based structures

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35338:


Assignee: Xinrong Meng  (was: Apache Spark)

> Separate arithmetic operations into data type based structures
> --
>
> Key: SPARK-35338
> URL: https://issues.apache.org/jira/browse/SPARK-35338
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> For arithmetic operations:
> {code:java}
> __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
> __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__
> {code}
> we would like to separate them into data-type-based structures.
> The existing behaviors of each arithmetic operation should be preserved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35449:


Assignee: Apache Spark

> Should not extract common expressions from value expressions when elseValue 
> is empty in CaseWhen
> 
>
> Key: SPARK-35449
> URL: https://issues.apache.org/jira/browse/SPARK-35449
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347986#comment-17347986
 ] 

Apache Spark commented on SPARK-35449:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/32595

> Should not extract common expressions from value expressions when elseValue 
> is empty in CaseWhen
> 
>
> Key: SPARK-35449
> URL: https://issues.apache.org/jira/browse/SPARK-35449
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35449:


Assignee: (was: Apache Spark)

> Should not extract common expressions from value expressions when elseValue 
> is empty in CaseWhen
> 
>
> Key: SPARK-35449
> URL: https://issues.apache.org/jira/browse/SPARK-35449
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347988#comment-17347988
 ] 

Apache Spark commented on SPARK-35449:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/32595

> Should not extract common expressions from value expressions when elseValue 
> is empty in CaseWhen
> 
>
> Key: SPARK-35449
> URL: https://issues.apache.org/jira/browse/SPARK-35449
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35338) Separate arithmetic operations into data type based structures

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347991#comment-17347991
 ] 

Apache Spark commented on SPARK-35338:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/32596

> Separate arithmetic operations into data type based structures
> --
>
> Key: SPARK-35338
> URL: https://issues.apache.org/jira/browse/SPARK-35338
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> For arithmetic operations:
> {code:java}
> __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
> __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__
> {code}
> we would like to separate them into data-type-based structures.
> The existing behaviors of each arithmetic operation should be preserved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35338) Separate arithmetic operations into data type based structures

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347992#comment-17347992
 ] 

Apache Spark commented on SPARK-35338:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/32596

> Separate arithmetic operations into data type based structures
> --
>
> Key: SPARK-35338
> URL: https://issues.apache.org/jira/browse/SPARK-35338
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> For arithmetic operations:
> {code:java}
> __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
> __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__
> {code}
> we would like to separate them into data-type-based structures.
> The existing behaviors of each arithmetic operation should be preserved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35446) Override getJDBCType in MySQLDialect to map FloatType to FLOAT

2021-05-19 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347993#comment-17347993
 ] 

Takeshi Yamamuro commented on SPARK-35446:
--

Oh, I was a bit surprised that mysql treats real as double... Anyway, the 
proposal sounds good. do you wanna open a PR to fix this?

> Override getJDBCType in MySQLDialect to map FloatType to FLOAT
> --
>
> Key: SPARK-35446
> URL: https://issues.apache.org/jira/browse/SPARK-35446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Marios Meimaris
>Priority: Major
>
> In 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L165,]
>  FloatType is mapped to REAL. However, MySQL treats REAL as a synonym to 
> DOUBLE by default (see 
> [https://dev.mysql.com/doc/refman/8.0/en/numeric-types.html] for more 
> information). 
> `MySQLDialect` could override `getJDBCType` so that it maps FloatType to 
> FLOAT instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35450) Follow checkout-merge way to use the latest commit for linter, or other workflows.

2021-05-19 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-35450:
-

 Summary: Follow checkout-merge way to use the latest commit for 
linter, or other workflows.
 Key: SPARK-35450
 URL: https://issues.apache.org/jira/browse/SPARK-35450
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 3.2.0
Reporter: Takuya Ueshin


For linter or other workflows besides build-and-tests, we should follow 
checkout-merge way to use the latest commit; otherwise, those could work on the 
old settings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35451) Sync with Apache Spark branch for all builds such as linter in GitHub Actions

2021-05-19 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-35451:


 Summary: Sync with Apache Spark branch for all builds such as 
linter in GitHub Actions
 Key: SPARK-35451
 URL: https://issues.apache.org/jira/browse/SPARK-35451
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


We currently just simply checking out 
https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L471-L472
 for some build-only or linters.

This seems like easily casuing logical conflict when the PR is not synced 
against the latest changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35451) Sync with Apache Spark branch for all builds such as linter in GitHub Actions

2021-05-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35451.
--
Resolution: Duplicate

> Sync with Apache Spark branch for all builds such as linter in GitHub Actions
> -
>
> Key: SPARK-35451
> URL: https://issues.apache.org/jira/browse/SPARK-35451
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We currently just simply checking out 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L471-L472
>  for some build-only or linters.
> This seems like easily casuing logical conflict when the PR is not synced 
> against the latest changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35450) Follow checkout-merge way to use the latest commit for linter, or other workflows.

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35450:


Assignee: Apache Spark

> Follow checkout-merge way to use the latest commit for linter, or other 
> workflows.
> --
>
> Key: SPARK-35450
> URL: https://issues.apache.org/jira/browse/SPARK-35450
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> For linter or other workflows besides build-and-tests, we should follow 
> checkout-merge way to use the latest commit; otherwise, those could work on 
> the old settings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35450) Follow checkout-merge way to use the latest commit for linter, or other workflows.

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35450:


Assignee: (was: Apache Spark)

> Follow checkout-merge way to use the latest commit for linter, or other 
> workflows.
> --
>
> Key: SPARK-35450
> URL: https://issues.apache.org/jira/browse/SPARK-35450
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> For linter or other workflows besides build-and-tests, we should follow 
> checkout-merge way to use the latest commit; otherwise, those could work on 
> the old settings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35450) Follow checkout-merge way to use the latest commit for linter, or other workflows.

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347997#comment-17347997
 ] 

Apache Spark commented on SPARK-35450:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/32597

> Follow checkout-merge way to use the latest commit for linter, or other 
> workflows.
> --
>
> Key: SPARK-35450
> URL: https://issues.apache.org/jira/browse/SPARK-35450
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> For linter or other workflows besides build-and-tests, we should follow 
> checkout-merge way to use the latest commit; otherwise, those could work on 
> the old settings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35408) Improve parameter validation in DataFrame.show

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348000#comment-17348000
 ] 

Apache Spark commented on SPARK-35408:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32598

> Improve parameter validation in DataFrame.show
> --
>
> Key: SPARK-35408
> URL: https://issues.apache.org/jira/browse/SPARK-35408
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.1
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
>Priority: Major
> Fix For: 3.2.0
>
>
> Being more used to Scala API the user may  be tempted to call  
> {code:python}
> df.show(False)
> {code}
> in PySparkl and she will receive an error message that does not easily map to 
> the user code 
> {noformat}
> py4j.Py4JException: Method showString([class java.lang.Boolean, class 
> java.lang.Integer, class java.lang.Boolean]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
>   at py4j.Gateway.invoke(Gateway.java:274)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35320) from_json cannot parse maps with timestamp as key

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35320:


Assignee: Apache Spark

> from_json cannot parse maps with timestamp as key
> -
>
> Key: SPARK-35320
> URL: https://issues.apache.org/jira/browse/SPARK-35320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.1
> Environment: * Java 11
>  * Spark 3.0.1/3.1.1
>  * Scala 2.12
>Reporter: Vincenzo Cerminara
>Assignee: Apache Spark
>Priority: Minor
>
> I have a json that contains a {{map}} like the following
> {code:json}
> {
>   "map": {
> "2021-05-05T20:05:08": "sampleValue"
>   }
> }
> {code}
> The key of the map is a string containing a formatted timestamp and I want to 
> parse it as a Java Map using the from_json 
> Spark SQL function (see the {{Sample}} class in the code below).
> {code:java}
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Encoders;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import java.io.Serializable;
> import java.time.Instant;
> import java.util.List;
> import java.util.Map;
> import static org.apache.spark.sql.functions.*;
> public class TimestampAsJsonMapKey {
> public static class Sample implements Serializable {
> private Map map;
> 
> public Map getMap() {
> return map;
> }
> 
> public void setMap(Map map) {
> this.map = map;
> }
> }
> public static class InvertedSample implements Serializable {
> private Map map;
> 
> public Map getMap() {
> return map;
> }
> 
> public void setMap(Map map) {
> this.map = map;
> }
> }
> public static void main(String[] args) {
> final SparkSession spark = SparkSession
> .builder()
> .appName("Timestamp As Json Map Key Test")
> .master("local[1]")
> .getOrCreate();
> workingTest(spark);
> notWorkingTest(spark);
> }
> private static void workingTest(SparkSession spark) {
> //language=JSON
> final String invertedSampleJson = "{ \"map\": { \"sampleValue\": 
> \"2021-05-05T20:05:08\" } }";
> final Dataset samplesDf = 
> spark.createDataset(List.of(invertedSampleJson), Encoders.STRING());
> final Dataset parsedDf = 
> samplesDf.select(from_json(col("value"), 
> Encoders.bean(InvertedSample.class).schema()));
> parsedDf.show(false);
> }
> private static void notWorkingTest(SparkSession spark) {
> //language=JSON
> final String sampleJson = "{ \"map\": { \"2021-05-05T20:05:08\": 
> \"sampleValue\" } }";
> final Dataset samplesDf = 
> spark.createDataset(List.of(sampleJson), Encoders.STRING());
> final Dataset parsedDf = 
> samplesDf.select(from_json(col("value"), 
> Encoders.bean(Sample.class).schema()));
> parsedDf.show(false);
> }
> }
> {code}
> When I run the {{notWorkingTest}} method it fails with the following 
> exception:
> {noformat}
> Exception in thread "main" java.lang.ClassCastException: class 
> org.apache.spark.unsafe.types.UTF8String cannot be cast to class 
> java.lang.Long (org.apache.spark.unsafe.types.UTF8String is in unnamed module 
> of loader 'app'; java.lang.Long is in module java.base of loader 'bootstrap')
>   at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$8$adapted(Cast.scala:297)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$7(Cast.scala:297)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$12(Cast.scala:329)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$11(Cast.scala:321)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$14(Cast.scala:359)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$13(Cast.scala:352)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:815)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
>   at 
> org.apache.spark.sql.catalys

[jira] [Assigned] (SPARK-35320) from_json cannot parse maps with timestamp as key

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35320:


Assignee: (was: Apache Spark)

> from_json cannot parse maps with timestamp as key
> -
>
> Key: SPARK-35320
> URL: https://issues.apache.org/jira/browse/SPARK-35320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.1
> Environment: * Java 11
>  * Spark 3.0.1/3.1.1
>  * Scala 2.12
>Reporter: Vincenzo Cerminara
>Priority: Minor
>
> I have a json that contains a {{map}} like the following
> {code:json}
> {
>   "map": {
> "2021-05-05T20:05:08": "sampleValue"
>   }
> }
> {code}
> The key of the map is a string containing a formatted timestamp and I want to 
> parse it as a Java Map using the from_json 
> Spark SQL function (see the {{Sample}} class in the code below).
> {code:java}
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Encoders;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import java.io.Serializable;
> import java.time.Instant;
> import java.util.List;
> import java.util.Map;
> import static org.apache.spark.sql.functions.*;
> public class TimestampAsJsonMapKey {
> public static class Sample implements Serializable {
> private Map map;
> 
> public Map getMap() {
> return map;
> }
> 
> public void setMap(Map map) {
> this.map = map;
> }
> }
> public static class InvertedSample implements Serializable {
> private Map map;
> 
> public Map getMap() {
> return map;
> }
> 
> public void setMap(Map map) {
> this.map = map;
> }
> }
> public static void main(String[] args) {
> final SparkSession spark = SparkSession
> .builder()
> .appName("Timestamp As Json Map Key Test")
> .master("local[1]")
> .getOrCreate();
> workingTest(spark);
> notWorkingTest(spark);
> }
> private static void workingTest(SparkSession spark) {
> //language=JSON
> final String invertedSampleJson = "{ \"map\": { \"sampleValue\": 
> \"2021-05-05T20:05:08\" } }";
> final Dataset samplesDf = 
> spark.createDataset(List.of(invertedSampleJson), Encoders.STRING());
> final Dataset parsedDf = 
> samplesDf.select(from_json(col("value"), 
> Encoders.bean(InvertedSample.class).schema()));
> parsedDf.show(false);
> }
> private static void notWorkingTest(SparkSession spark) {
> //language=JSON
> final String sampleJson = "{ \"map\": { \"2021-05-05T20:05:08\": 
> \"sampleValue\" } }";
> final Dataset samplesDf = 
> spark.createDataset(List.of(sampleJson), Encoders.STRING());
> final Dataset parsedDf = 
> samplesDf.select(from_json(col("value"), 
> Encoders.bean(Sample.class).schema()));
> parsedDf.show(false);
> }
> }
> {code}
> When I run the {{notWorkingTest}} method it fails with the following 
> exception:
> {noformat}
> Exception in thread "main" java.lang.ClassCastException: class 
> org.apache.spark.unsafe.types.UTF8String cannot be cast to class 
> java.lang.Long (org.apache.spark.unsafe.types.UTF8String is in unnamed module 
> of loader 'app'; java.lang.Long is in module java.base of loader 'bootstrap')
>   at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$8$adapted(Cast.scala:297)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$7(Cast.scala:297)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$12(Cast.scala:329)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$11(Cast.scala:321)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$14(Cast.scala:359)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$13(Cast.scala:352)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:815)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.Interpreted

[jira] [Assigned] (SPARK-35450) Follow checkout-merge way to use the latest commit for linter, or other workflows.

2021-05-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35450:


Assignee: Takuya Ueshin

> Follow checkout-merge way to use the latest commit for linter, or other 
> workflows.
> --
>
> Key: SPARK-35450
> URL: https://issues.apache.org/jira/browse/SPARK-35450
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>
> For linter or other workflows besides build-and-tests, we should follow 
> checkout-merge way to use the latest commit; otherwise, those could work on 
> the old settings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35320) from_json cannot parse maps with timestamp as key

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348002#comment-17348002
 ] 

Apache Spark commented on SPARK-35320:
--

User 'planga82' has created a pull request for this issue:
https://github.com/apache/spark/pull/32599

> from_json cannot parse maps with timestamp as key
> -
>
> Key: SPARK-35320
> URL: https://issues.apache.org/jira/browse/SPARK-35320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.1
> Environment: * Java 11
>  * Spark 3.0.1/3.1.1
>  * Scala 2.12
>Reporter: Vincenzo Cerminara
>Priority: Minor
>
> I have a json that contains a {{map}} like the following
> {code:json}
> {
>   "map": {
> "2021-05-05T20:05:08": "sampleValue"
>   }
> }
> {code}
> The key of the map is a string containing a formatted timestamp and I want to 
> parse it as a Java Map using the from_json 
> Spark SQL function (see the {{Sample}} class in the code below).
> {code:java}
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Encoders;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import java.io.Serializable;
> import java.time.Instant;
> import java.util.List;
> import java.util.Map;
> import static org.apache.spark.sql.functions.*;
> public class TimestampAsJsonMapKey {
> public static class Sample implements Serializable {
> private Map map;
> 
> public Map getMap() {
> return map;
> }
> 
> public void setMap(Map map) {
> this.map = map;
> }
> }
> public static class InvertedSample implements Serializable {
> private Map map;
> 
> public Map getMap() {
> return map;
> }
> 
> public void setMap(Map map) {
> this.map = map;
> }
> }
> public static void main(String[] args) {
> final SparkSession spark = SparkSession
> .builder()
> .appName("Timestamp As Json Map Key Test")
> .master("local[1]")
> .getOrCreate();
> workingTest(spark);
> notWorkingTest(spark);
> }
> private static void workingTest(SparkSession spark) {
> //language=JSON
> final String invertedSampleJson = "{ \"map\": { \"sampleValue\": 
> \"2021-05-05T20:05:08\" } }";
> final Dataset samplesDf = 
> spark.createDataset(List.of(invertedSampleJson), Encoders.STRING());
> final Dataset parsedDf = 
> samplesDf.select(from_json(col("value"), 
> Encoders.bean(InvertedSample.class).schema()));
> parsedDf.show(false);
> }
> private static void notWorkingTest(SparkSession spark) {
> //language=JSON
> final String sampleJson = "{ \"map\": { \"2021-05-05T20:05:08\": 
> \"sampleValue\" } }";
> final Dataset samplesDf = 
> spark.createDataset(List.of(sampleJson), Encoders.STRING());
> final Dataset parsedDf = 
> samplesDf.select(from_json(col("value"), 
> Encoders.bean(Sample.class).schema()));
> parsedDf.show(false);
> }
> }
> {code}
> When I run the {{notWorkingTest}} method it fails with the following 
> exception:
> {noformat}
> Exception in thread "main" java.lang.ClassCastException: class 
> org.apache.spark.unsafe.types.UTF8String cannot be cast to class 
> java.lang.Long (org.apache.spark.unsafe.types.UTF8String is in unnamed module 
> of loader 'app'; java.lang.Long is in module java.base of loader 'bootstrap')
>   at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$8$adapted(Cast.scala:297)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$7(Cast.scala:297)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$12(Cast.scala:329)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$11(Cast.scala:321)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$14(Cast.scala:359)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$13(Cast.scala:352)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:815)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
>   at 
> org.apache.spark.sql.catalyst.exp

[jira] [Resolved] (SPARK-35450) Follow checkout-merge way to use the latest commit for linter, or other workflows.

2021-05-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35450.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32597
[https://github.com/apache/spark/pull/32597]

> Follow checkout-merge way to use the latest commit for linter, or other 
> workflows.
> --
>
> Key: SPARK-35450
> URL: https://issues.apache.org/jira/browse/SPARK-35450
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>
> For linter or other workflows besides build-and-tests, we should follow 
> checkout-merge way to use the latest commit; otherwise, those could work on 
> the old settings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35452) Introduce ComplexTypeOps

2021-05-19 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-35452:


 Summary: Introduce ComplexTypeOps
 Key: SPARK-35452
 URL: https://issues.apache.org/jira/browse/SPARK-35452
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


StructType, ArrayType, MapType, and BinaryType are not accepted by DataTypeOps 
now.

We should handle these complex types.

Please note that arithmetic operations might be applied to these complex types, 
for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same 
as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35453) Move Koalas accessor to pandas_on_spark accessor

2021-05-19 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-35453:
---

 Summary: Move Koalas accessor to pandas_on_spark accessor
 Key: SPARK-35453
 URL: https://issues.apache.org/jira/browse/SPARK-35453
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Haejoon Lee


The existing Koalas has the "Koalas accessor" which named after Koalas project.

 

We should rename this accessor to "Pandas on Spark accessor".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35453) Move Koalas accessor to pandas_on_spark accessor

2021-05-19 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348021#comment-17348021
 ] 

Haejoon Lee commented on SPARK-35453:
-

I'm working on this

> Move Koalas accessor to pandas_on_spark accessor
> 
>
> Key: SPARK-35453
> URL: https://issues.apache.org/jira/browse/SPARK-35453
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The existing Koalas has the "Koalas accessor" which named after Koalas 
> project.
>  
> We should rename this accessor to "Pandas on Spark accessor".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35338) Separate arithmetic operations into data type based structures

2021-05-19 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-35338.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32596
https://github.com/apache/spark/pull/32596

> Separate arithmetic operations into data type based structures
> --
>
> Key: SPARK-35338
> URL: https://issues.apache.org/jira/browse/SPARK-35338
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> For arithmetic operations:
> {code:java}
> __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
> __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__
> {code}
> we would like to separate them into data-type-based structures.
> The existing behaviors of each arithmetic operation should be preserved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33428) conv UDF returns incorrect value

2021-05-19 Thread Nguyen Quang Huy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348024#comment-17348024
 ] 

Nguyen Quang Huy commented on SPARK-33428:
--

should we consider return null value if overflow occur since bigInt cann't 
handle signed and unsigned base?

> conv UDF returns incorrect value
> 
>
> Key: SPARK-33428
> URL: https://issues.apache.org/jira/browse/SPARK-33428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {noformat}
> spark-sql> select java_method('scala.math.BigInt', 'apply', 
> 'c8dcdfb41711fc9a1f17928001d7fd61', 16);
> 266992441711411603393340504520074460513
> spark-sql> select conv('c8dcdfb41711fc9a1f17928001d7fd61', 16, 10);
> 18446744073709551615
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35454) Ambiguous self-join doesn't fail after transfroming the dataset to dataframe

2021-05-19 Thread wuyi (Jira)
wuyi created SPARK-35454:


 Summary: Ambiguous self-join doesn't fail after transfroming the 
dataset to dataframe
 Key: SPARK-35454
 URL: https://issues.apache.org/jira/browse/SPARK-35454
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1, 3.1.0
Reporter: wuyi


{code:java}
test("SPARK-28344: fail ambiguous self join - Dataset.colRegex as column ref") {
  val df1 = spark.range(3)
  val df2 = df1.filter($"id" > 0)

  withSQLConf(
SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
assertAmbiguousSelfJoin(df1.join(df2, df1.colRegex("id") > 
df2.colRegex("id")))
  }
}
{code}
For this unit test, if we append `.toDF()` to both df1 and df2, the query won't 
fail. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used

2021-05-19 Thread Yuzhou Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348033#comment-17348033
 ] 

Yuzhou Sun commented on SPARK-35106:


Thanks Erik and Wenchen for initializing and reviewing the fix!

> HadoopMapReduceCommitProtocol performs bad rename when dynamic partition 
> overwrite is used
> --
>
> Key: SPARK-35106
> URL: https://issues.apache.org/jira/browse/SPARK-35106
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> Recently when evaluating the code in 
> {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under 
> the {{dynamicPartitionOverwrite == true}} scenario:
> {code:language=scala}
>   # BLOCK 1
>   if (dynamicPartitionOverwrite) {
> val absPartitionPaths = filesToMove.values.map(new 
> Path(_).getParent).toSet
> logDebug(s"Clean up absolute partition directories for overwriting: 
> $absPartitionPaths")
> absPartitionPaths.foreach(fs.delete(_, true))
>   }
>   # BLOCK 2
>   for ((src, dst) <- filesToMove) {
> fs.rename(new Path(src), new Path(dst))
>   }
>   # BLOCK 3
>   if (dynamicPartitionOverwrite) {
> val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _)
> logDebug(s"Clean up default partition directories for overwriting: 
> $partitionPaths")
> for (part <- partitionPaths) {
>   val finalPartPath = new Path(path, part)
>   if (!fs.delete(finalPartPath, true) && 
> !fs.exists(finalPartPath.getParent)) {
> // According to the official hadoop FileSystem API spec, delete 
> op should assume
> // the destination is no longer present regardless of return 
> value, thus we do not
> // need to double check if finalPartPath exists before rename.
> // Also in our case, based on the spec, delete returns false only 
> when finalPartPath
> // does not exist. When this happens, we need to take action if 
> parent of finalPartPath
> // also does not exist(e.g. the scenario described on 
> SPARK-23815), because
> // FileSystem API spec on rename op says the rename 
> dest(finalPartPath) must have
> // a parent that exists, otherwise we may get unexpected result 
> on the rename.
> fs.mkdirs(finalPartPath.getParent)
>   }
>   fs.rename(new Path(stagingDir, part), finalPartPath)
> }
>   }
> {code}
> Assuming {{dynamicPartitionOverwrite == true}}, we have the following 
> sequence of events:
> # Block 1 deletes all parent directories of {{filesToMove.values}}
> # Block 2 attempts to rename all {{filesToMove.keys}} to 
> {{filesToMove.values}}
> # Block 3 does directory-level renames to place files into their final 
> locations
> All renames in Block 2 will always fail, since all parent directories of 
> {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS 
> scenario, the contract of {{fs.rename}} is to return {{false}} under such a 
> failure scenario, as opposed to throwing an exception. There is a separate 
> issue here that Block 2 should probably be checking for those {{false}} 
> return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", 
> albeit with a bunch of failed renames in the middle. Really, we should only 
> run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and 
> consolidate Blocks 1 and 3 to run in the {{true}} case.
> We discovered this issue when testing against a {{FileSystem}} implementation 
> which was throwing an exception for this failed rename scenario instead of 
> returning false, escalating the silent/ignored rename failures into actual 
> failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35455) Enhance EliminateUnnecessaryJoin

2021-05-19 Thread XiDuo You (Jira)
XiDuo You created SPARK-35455:
-

 Summary: Enhance EliminateUnnecessaryJoin
 Key: SPARK-35455
 URL: https://issues.apache.org/jira/browse/SPARK-35455
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: XiDuo You


Make EliminateUnnecessaryJoin support to eliminate outer join and multi-join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35456) Show invalid value in config entry check error message

2021-05-19 Thread Kent Yao (Jira)
Kent Yao created SPARK-35456:


 Summary: Show invalid value in config entry check error message
 Key: SPARK-35456
 URL: https://issues.apache.org/jira/browse/SPARK-35456
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35443) Mark K8s secrets and config maps as immutable

2021-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35443:
-

Assignee: Ashray Jain

> Mark K8s secrets and config maps as immutable
> -
>
> Key: SPARK-35443
> URL: https://issues.apache.org/jira/browse/SPARK-35443
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Ashray Jain
>Assignee: Ashray Jain
>Priority: Minor
>  Labels: kubernetes
>
> Kubernetes supports marking secrets and config maps as immutable to gain 
> performance. 
> [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable]
> [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable]
> For K8s clusters that run many thousands of Spark applications, this can 
> yield significant reduction in load on the kube-apiserver.
> From the K8s docs:
> {quote}For clusters that extensively use Secrets (at least tens of thousands 
> of unique Secret to Pod mounts), preventing changes to their data has the 
> following advantages:
>  * protects you from accidental (or unwanted) updates that could cause 
> applications outages
>  * improves performance of your cluster by significantly reducing load on 
> kube-apiserver, by closing watches for secrets marked as immutable.{quote}
>  
> For any secrets and config maps we create in Spark that are immutable, we 
> could mark them as immutable by including the following when building the 
> secret/config map
> {code:java}
> .withImmutable(true)
> {code}
> This feature has been supported in K8s as beta since K8s 1.19 and as GA since 
> K8s 1.21



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35443) Mark K8s secrets and config maps as immutable

2021-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35443.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32588
[https://github.com/apache/spark/pull/32588]

> Mark K8s secrets and config maps as immutable
> -
>
> Key: SPARK-35443
> URL: https://issues.apache.org/jira/browse/SPARK-35443
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Ashray Jain
>Assignee: Ashray Jain
>Priority: Minor
>  Labels: kubernetes
> Fix For: 3.2.0
>
>
> Kubernetes supports marking secrets and config maps as immutable to gain 
> performance. 
> [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable]
> [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable]
> For K8s clusters that run many thousands of Spark applications, this can 
> yield significant reduction in load on the kube-apiserver.
> From the K8s docs:
> {quote}For clusters that extensively use Secrets (at least tens of thousands 
> of unique Secret to Pod mounts), preventing changes to their data has the 
> following advantages:
>  * protects you from accidental (or unwanted) updates that could cause 
> applications outages
>  * improves performance of your cluster by significantly reducing load on 
> kube-apiserver, by closing watches for secrets marked as immutable.{quote}
>  
> For any secrets and config maps we create in Spark that are immutable, we 
> could mark them as immutable by including the following when building the 
> secret/config map
> {code:java}
> .withImmutable(true)
> {code}
> This feature has been supported in K8s as beta since K8s 1.19 and as GA since 
> K8s 1.21



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27991) ShuffleBlockFetcherIterator should take Netty constant-factor overheads into account when limiting number of simultaneous block fetches

2021-05-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27991.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32287
[https://github.com/apache/spark/pull/32287]

> ShuffleBlockFetcherIterator should take Netty constant-factor overheads into 
> account when limiting number of simultaneous block fetches
> ---
>
> Key: SPARK-27991
> URL: https://issues.apache.org/jira/browse/SPARK-27991
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
> Fix For: 3.2.0
>
>
> ShuffleBlockFetcherIterator has logic to limit the number of simultaneous 
> block fetches. By default, this logic tries to keep the number of outstanding 
> block fetches [beneath a data size 
> limit|https://github.com/apache/spark/blob/v2.4.3/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L274]
>  ({{maxBytesInFlight}}). However, this limiting does not take fixed overheads 
> into account: even though a remote block might be, say, 4KB, there are 
> certain fixed-size internal overheads due to Netty buffer sizes which may 
> cause the actual space requirements to be larger.
> As a result, if a map stage produces a huge number of extremely tiny blocks 
> then we may see errors like
> {code:java}
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 
> byte(s) of direct memory (used: 39325794304, max: 39325794304)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485)
> [...]
> Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
> 16777216 byte(s) of direct memory (used: 39325794304, max: 39325794304)
> at 
> io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640)
> at 
> io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594)
> at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764)
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740)
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324)
> [...]{code}
> SPARK-24989 is another report of this problem (but with a different proposed 
> fix).
> This problem can currently be mitigated by setting 
> {{spark.reducer.maxReqsInFlight}} to some some non-IntMax value (SPARK-6166), 
> but this additional manual configuration step is cumbersome.
> Instead, I think that Spark should take these fixed overheads into account in 
> the {{maxBytesInFlight}} calculation: instead of using blocks' actual sizes, 
> use {{Math.min(blockSize, minimumNettyBufferSize)}}. There might be some 
> tricky details involved to make this work on all configurations (e.g. to use 
> a different minimum when direct buffers are disabled, etc.), but I think the 
> core idea behind the fix is pretty simple.
> This will improve Spark's stability and removes configuration / tuning burden 
> from end users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35443) Mark K8s secrets and config maps as immutable

2021-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35443:
--
Labels:   (was: kubernetes)

> Mark K8s secrets and config maps as immutable
> -
>
> Key: SPARK-35443
> URL: https://issues.apache.org/jira/browse/SPARK-35443
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Ashray Jain
>Assignee: Ashray Jain
>Priority: Minor
> Fix For: 3.2.0
>
>
> Kubernetes supports marking secrets and config maps as immutable to gain 
> performance. 
> [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable]
> [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable]
> For K8s clusters that run many thousands of Spark applications, this can 
> yield significant reduction in load on the kube-apiserver.
> From the K8s docs:
> {quote}For clusters that extensively use Secrets (at least tens of thousands 
> of unique Secret to Pod mounts), preventing changes to their data has the 
> following advantages:
>  * protects you from accidental (or unwanted) updates that could cause 
> applications outages
>  * improves performance of your cluster by significantly reducing load on 
> kube-apiserver, by closing watches for secrets marked as immutable.{quote}
>  
> For any secrets and config maps we create in Spark that are immutable, we 
> could mark them as immutable by including the following when building the 
> secret/config map
> {code:java}
> .withImmutable(true)
> {code}
> This feature has been supported in K8s as beta since K8s 1.19 and as GA since 
> K8s 1.21



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35443) Mark K8s secrets and config maps as immutable

2021-05-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35443:
--
Affects Version/s: (was: 3.1.1)
   3.2.0

> Mark K8s secrets and config maps as immutable
> -
>
> Key: SPARK-35443
> URL: https://issues.apache.org/jira/browse/SPARK-35443
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Ashray Jain
>Assignee: Ashray Jain
>Priority: Minor
>  Labels: kubernetes
> Fix For: 3.2.0
>
>
> Kubernetes supports marking secrets and config maps as immutable to gain 
> performance. 
> [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable]
> [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable]
> For K8s clusters that run many thousands of Spark applications, this can 
> yield significant reduction in load on the kube-apiserver.
> From the K8s docs:
> {quote}For clusters that extensively use Secrets (at least tens of thousands 
> of unique Secret to Pod mounts), preventing changes to their data has the 
> following advantages:
>  * protects you from accidental (or unwanted) updates that could cause 
> applications outages
>  * improves performance of your cluster by significantly reducing load on 
> kube-apiserver, by closing watches for secrets marked as immutable.{quote}
>  
> For any secrets and config maps we create in Spark that are immutable, we 
> could mark them as immutable by including the following when building the 
> secret/config map
> {code:java}
> .withImmutable(true)
> {code}
> This feature has been supported in K8s as beta since K8s 1.19 and as GA since 
> K8s 1.21



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27991) ShuffleBlockFetcherIterator should take Netty constant-factor overheads into account when limiting number of simultaneous block fetches

2021-05-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27991:
---

Assignee: wuyi

> ShuffleBlockFetcherIterator should take Netty constant-factor overheads into 
> account when limiting number of simultaneous block fetches
> ---
>
> Key: SPARK-27991
> URL: https://issues.apache.org/jira/browse/SPARK-27991
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Assignee: wuyi
>Priority: Major
> Fix For: 3.2.0
>
>
> ShuffleBlockFetcherIterator has logic to limit the number of simultaneous 
> block fetches. By default, this logic tries to keep the number of outstanding 
> block fetches [beneath a data size 
> limit|https://github.com/apache/spark/blob/v2.4.3/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L274]
>  ({{maxBytesInFlight}}). However, this limiting does not take fixed overheads 
> into account: even though a remote block might be, say, 4KB, there are 
> certain fixed-size internal overheads due to Netty buffer sizes which may 
> cause the actual space requirements to be larger.
> As a result, if a map stage produces a huge number of extremely tiny blocks 
> then we may see errors like
> {code:java}
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 
> byte(s) of direct memory (used: 39325794304, max: 39325794304)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485)
> [...]
> Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
> 16777216 byte(s) of direct memory (used: 39325794304, max: 39325794304)
> at 
> io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640)
> at 
> io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594)
> at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764)
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740)
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324)
> [...]{code}
> SPARK-24989 is another report of this problem (but with a different proposed 
> fix).
> This problem can currently be mitigated by setting 
> {{spark.reducer.maxReqsInFlight}} to some some non-IntMax value (SPARK-6166), 
> but this additional manual configuration step is cumbersome.
> Instead, I think that Spark should take these fixed overheads into account in 
> the {{maxBytesInFlight}} calculation: instead of using blocks' actual sizes, 
> use {{Math.min(blockSize, minimumNettyBufferSize)}}. There might be some 
> tricky details involved to make this work on all configurations (e.g. to use 
> a different minimum when direct buffers are disabled, etc.), but I think the 
> core idea behind the fix is pretty simple.
> This will improve Spark's stability and removes configuration / tuning burden 
> from end users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used

2021-05-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35106:
---

Assignee: (was: Erik Krogen)

> HadoopMapReduceCommitProtocol performs bad rename when dynamic partition 
> overwrite is used
> --
>
> Key: SPARK-35106
> URL: https://issues.apache.org/jira/browse/SPARK-35106
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> Recently when evaluating the code in 
> {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under 
> the {{dynamicPartitionOverwrite == true}} scenario:
> {code:language=scala}
>   # BLOCK 1
>   if (dynamicPartitionOverwrite) {
> val absPartitionPaths = filesToMove.values.map(new 
> Path(_).getParent).toSet
> logDebug(s"Clean up absolute partition directories for overwriting: 
> $absPartitionPaths")
> absPartitionPaths.foreach(fs.delete(_, true))
>   }
>   # BLOCK 2
>   for ((src, dst) <- filesToMove) {
> fs.rename(new Path(src), new Path(dst))
>   }
>   # BLOCK 3
>   if (dynamicPartitionOverwrite) {
> val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _)
> logDebug(s"Clean up default partition directories for overwriting: 
> $partitionPaths")
> for (part <- partitionPaths) {
>   val finalPartPath = new Path(path, part)
>   if (!fs.delete(finalPartPath, true) && 
> !fs.exists(finalPartPath.getParent)) {
> // According to the official hadoop FileSystem API spec, delete 
> op should assume
> // the destination is no longer present regardless of return 
> value, thus we do not
> // need to double check if finalPartPath exists before rename.
> // Also in our case, based on the spec, delete returns false only 
> when finalPartPath
> // does not exist. When this happens, we need to take action if 
> parent of finalPartPath
> // also does not exist(e.g. the scenario described on 
> SPARK-23815), because
> // FileSystem API spec on rename op says the rename 
> dest(finalPartPath) must have
> // a parent that exists, otherwise we may get unexpected result 
> on the rename.
> fs.mkdirs(finalPartPath.getParent)
>   }
>   fs.rename(new Path(stagingDir, part), finalPartPath)
> }
>   }
> {code}
> Assuming {{dynamicPartitionOverwrite == true}}, we have the following 
> sequence of events:
> # Block 1 deletes all parent directories of {{filesToMove.values}}
> # Block 2 attempts to rename all {{filesToMove.keys}} to 
> {{filesToMove.values}}
> # Block 3 does directory-level renames to place files into their final 
> locations
> All renames in Block 2 will always fail, since all parent directories of 
> {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS 
> scenario, the contract of {{fs.rename}} is to return {{false}} under such a 
> failure scenario, as opposed to throwing an exception. There is a separate 
> issue here that Block 2 should probably be checking for those {{false}} 
> return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", 
> albeit with a bunch of failed renames in the middle. Really, we should only 
> run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and 
> consolidate Blocks 1 and 3 to run in the {{true}} case.
> We discovered this issue when testing against a {{FileSystem}} implementation 
> which was throwing an exception for this failed rename scenario instead of 
> returning false, escalating the silent/ignored rename failures into actual 
> failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used

2021-05-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35106:
---

Assignee: Yuzhou Sun

> HadoopMapReduceCommitProtocol performs bad rename when dynamic partition 
> overwrite is used
> --
>
> Key: SPARK-35106
> URL: https://issues.apache.org/jira/browse/SPARK-35106
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Assignee: Yuzhou Sun
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> Recently when evaluating the code in 
> {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under 
> the {{dynamicPartitionOverwrite == true}} scenario:
> {code:language=scala}
>   # BLOCK 1
>   if (dynamicPartitionOverwrite) {
> val absPartitionPaths = filesToMove.values.map(new 
> Path(_).getParent).toSet
> logDebug(s"Clean up absolute partition directories for overwriting: 
> $absPartitionPaths")
> absPartitionPaths.foreach(fs.delete(_, true))
>   }
>   # BLOCK 2
>   for ((src, dst) <- filesToMove) {
> fs.rename(new Path(src), new Path(dst))
>   }
>   # BLOCK 3
>   if (dynamicPartitionOverwrite) {
> val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _)
> logDebug(s"Clean up default partition directories for overwriting: 
> $partitionPaths")
> for (part <- partitionPaths) {
>   val finalPartPath = new Path(path, part)
>   if (!fs.delete(finalPartPath, true) && 
> !fs.exists(finalPartPath.getParent)) {
> // According to the official hadoop FileSystem API spec, delete 
> op should assume
> // the destination is no longer present regardless of return 
> value, thus we do not
> // need to double check if finalPartPath exists before rename.
> // Also in our case, based on the spec, delete returns false only 
> when finalPartPath
> // does not exist. When this happens, we need to take action if 
> parent of finalPartPath
> // also does not exist(e.g. the scenario described on 
> SPARK-23815), because
> // FileSystem API spec on rename op says the rename 
> dest(finalPartPath) must have
> // a parent that exists, otherwise we may get unexpected result 
> on the rename.
> fs.mkdirs(finalPartPath.getParent)
>   }
>   fs.rename(new Path(stagingDir, part), finalPartPath)
> }
>   }
> {code}
> Assuming {{dynamicPartitionOverwrite == true}}, we have the following 
> sequence of events:
> # Block 1 deletes all parent directories of {{filesToMove.values}}
> # Block 2 attempts to rename all {{filesToMove.keys}} to 
> {{filesToMove.values}}
> # Block 3 does directory-level renames to place files into their final 
> locations
> All renames in Block 2 will always fail, since all parent directories of 
> {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS 
> scenario, the contract of {{fs.rename}} is to return {{false}} under such a 
> failure scenario, as opposed to throwing an exception. There is a separate 
> issue here that Block 2 should probably be checking for those {{false}} 
> return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", 
> albeit with a bunch of failed renames in the middle. Really, we should only 
> run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and 
> consolidate Blocks 1 and 3 to run in the {{true}} case.
> We discovered this issue when testing against a {{FileSystem}} implementation 
> which was throwing an exception for this failed rename scenario instead of 
> returning false, escalating the silent/ignored rename failures into actual 
> failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35456) Show invalid value in config entry check error message

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35456:


Assignee: (was: Apache Spark)

> Show invalid value in config entry check error message
> --
>
> Key: SPARK-35456
> URL: https://issues.apache.org/jira/browse/SPARK-35456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35456) Show invalid value in config entry check error message

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348060#comment-17348060
 ] 

Apache Spark commented on SPARK-35456:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/32600

> Show invalid value in config entry check error message
> --
>
> Key: SPARK-35456
> URL: https://issues.apache.org/jira/browse/SPARK-35456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35456) Show invalid value in config entry check error message

2021-05-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35456:


Assignee: Apache Spark

> Show invalid value in config entry check error message
> --
>
> Key: SPARK-35456
> URL: https://issues.apache.org/jira/browse/SPARK-35456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35130) Construct day-time interval column from integral fields

2021-05-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348064#comment-17348064
 ] 

Apache Spark commented on SPARK-35130:
--

User 'copperybean' has created a pull request for this issue:
https://github.com/apache/spark/pull/32601

> Construct day-time interval column from integral fields
> ---
>
> Key: SPARK-35130
> URL: https://issues.apache.org/jira/browse/SPARK-35130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Create new function similar to make_interval() (or extend the make_interval() 
> function) which can construct DayTimeIntervalType values from day, hour, 
> minute, second fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >