[jira] [Created] (SPARK-35441) InMemoryFileIndex load all files into memroy
Zhang Jianguo created SPARK-35441: - Summary: InMemoryFileIndex load all files into memroy Key: SPARK-35441 URL: https://issues.apache.org/jira/browse/SPARK-35441 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.1 Reporter: Zhang Jianguo When InMemoryFileIndex is initialated, it'll load all files of ${rootPath}. If there're thousands of files, it could lead to OOM. https://github.com/apache/spark/blob/0b3758e8cdb3eaa9d55ce3b41ecad5fa01567343/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L66 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used
[ https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-35106: --- Assignee: Erik Krogen > HadoopMapReduceCommitProtocol performs bad rename when dynamic partition > overwrite is used > -- > > Key: SPARK-35106 > URL: https://issues.apache.org/jira/browse/SPARK-35106 > Project: Spark > Issue Type: Bug > Components: Input/Output, Spark Core >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > > Recently when evaluating the code in > {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under > the {{dynamicPartitionOverwrite == true}} scenario: > {code:language=scala} > # BLOCK 1 > if (dynamicPartitionOverwrite) { > val absPartitionPaths = filesToMove.values.map(new > Path(_).getParent).toSet > logDebug(s"Clean up absolute partition directories for overwriting: > $absPartitionPaths") > absPartitionPaths.foreach(fs.delete(_, true)) > } > # BLOCK 2 > for ((src, dst) <- filesToMove) { > fs.rename(new Path(src), new Path(dst)) > } > # BLOCK 3 > if (dynamicPartitionOverwrite) { > val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _) > logDebug(s"Clean up default partition directories for overwriting: > $partitionPaths") > for (part <- partitionPaths) { > val finalPartPath = new Path(path, part) > if (!fs.delete(finalPartPath, true) && > !fs.exists(finalPartPath.getParent)) { > // According to the official hadoop FileSystem API spec, delete > op should assume > // the destination is no longer present regardless of return > value, thus we do not > // need to double check if finalPartPath exists before rename. > // Also in our case, based on the spec, delete returns false only > when finalPartPath > // does not exist. When this happens, we need to take action if > parent of finalPartPath > // also does not exist(e.g. the scenario described on > SPARK-23815), because > // FileSystem API spec on rename op says the rename > dest(finalPartPath) must have > // a parent that exists, otherwise we may get unexpected result > on the rename. > fs.mkdirs(finalPartPath.getParent) > } > fs.rename(new Path(stagingDir, part), finalPartPath) > } > } > {code} > Assuming {{dynamicPartitionOverwrite == true}}, we have the following > sequence of events: > # Block 1 deletes all parent directories of {{filesToMove.values}} > # Block 2 attempts to rename all {{filesToMove.keys}} to > {{filesToMove.values}} > # Block 3 does directory-level renames to place files into their final > locations > All renames in Block 2 will always fail, since all parent directories of > {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS > scenario, the contract of {{fs.rename}} is to return {{false}} under such a > failure scenario, as opposed to throwing an exception. There is a separate > issue here that Block 2 should probably be checking for those {{false}} > return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", > albeit with a bunch of failed renames in the middle. Really, we should only > run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and > consolidate Blocks 1 and 3 to run in the {{true}} case. > We discovered this issue when testing against a {{FileSystem}} implementation > which was throwing an exception for this failed rename scenario instead of > returning false, escalating the silent/ignored rename failures into actual > failures. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used
[ https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-35106. - Fix Version/s: 3.0.3 3.1.2 3.2.0 Resolution: Fixed Issue resolved by pull request 32530 [https://github.com/apache/spark/pull/32530] > HadoopMapReduceCommitProtocol performs bad rename when dynamic partition > overwrite is used > -- > > Key: SPARK-35106 > URL: https://issues.apache.org/jira/browse/SPARK-35106 > Project: Spark > Issue Type: Bug > Components: Input/Output, Spark Core >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.2.0, 3.1.2, 3.0.3 > > > Recently when evaluating the code in > {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under > the {{dynamicPartitionOverwrite == true}} scenario: > {code:language=scala} > # BLOCK 1 > if (dynamicPartitionOverwrite) { > val absPartitionPaths = filesToMove.values.map(new > Path(_).getParent).toSet > logDebug(s"Clean up absolute partition directories for overwriting: > $absPartitionPaths") > absPartitionPaths.foreach(fs.delete(_, true)) > } > # BLOCK 2 > for ((src, dst) <- filesToMove) { > fs.rename(new Path(src), new Path(dst)) > } > # BLOCK 3 > if (dynamicPartitionOverwrite) { > val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _) > logDebug(s"Clean up default partition directories for overwriting: > $partitionPaths") > for (part <- partitionPaths) { > val finalPartPath = new Path(path, part) > if (!fs.delete(finalPartPath, true) && > !fs.exists(finalPartPath.getParent)) { > // According to the official hadoop FileSystem API spec, delete > op should assume > // the destination is no longer present regardless of return > value, thus we do not > // need to double check if finalPartPath exists before rename. > // Also in our case, based on the spec, delete returns false only > when finalPartPath > // does not exist. When this happens, we need to take action if > parent of finalPartPath > // also does not exist(e.g. the scenario described on > SPARK-23815), because > // FileSystem API spec on rename op says the rename > dest(finalPartPath) must have > // a parent that exists, otherwise we may get unexpected result > on the rename. > fs.mkdirs(finalPartPath.getParent) > } > fs.rename(new Path(stagingDir, part), finalPartPath) > } > } > {code} > Assuming {{dynamicPartitionOverwrite == true}}, we have the following > sequence of events: > # Block 1 deletes all parent directories of {{filesToMove.values}} > # Block 2 attempts to rename all {{filesToMove.keys}} to > {{filesToMove.values}} > # Block 3 does directory-level renames to place files into their final > locations > All renames in Block 2 will always fail, since all parent directories of > {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS > scenario, the contract of {{fs.rename}} is to return {{false}} under such a > failure scenario, as opposed to throwing an exception. There is a separate > issue here that Block 2 should probably be checking for those {{false}} > return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", > albeit with a bunch of failed renames in the middle. Really, we should only > run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and > consolidate Blocks 1 and 3 to run in the {{true}} case. > We discovered this issue when testing against a {{FileSystem}} implementation > which was throwing an exception for this failed rename scenario instead of > returning false, escalating the silent/ignored rename failures into actual > failures. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35368) [SQL]Update histogram statistics for RANGE operator stats estimation
[ https://issues.apache.org/jira/browse/SPARK-35368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-35368. -- Fix Version/s: 3.2.0 Assignee: shahid Resolution: Fixed Resolved by https://github.com/apache/spark/pull/32498 > [SQL]Update histogram statistics for RANGE operator stats estimation > > > Key: SPARK-35368 > URL: https://issues.apache.org/jira/browse/SPARK-35368 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.1.0 >Reporter: shahid >Assignee: shahid >Priority: Minor > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35441) InMemoryFileIndex load all files into memroy
[ https://issues.apache.org/jira/browse/SPARK-35441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347404#comment-17347404 ] Yuming Wang commented on SPARK-35441: - What is your driver memory? > InMemoryFileIndex load all files into memroy > > > Key: SPARK-35441 > URL: https://issues.apache.org/jira/browse/SPARK-35441 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Zhang Jianguo >Priority: Minor > > When InMemoryFileIndex is initialated, it'll load all files of ${rootPath}. > If there're thousands of files, it could lead to OOM. > > https://github.com/apache/spark/blob/0b3758e8cdb3eaa9d55ce3b41ecad5fa01567343/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L66 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35442) Eliminate unnecessary join through Aggregate
XiDuo You created SPARK-35442: - Summary: Eliminate unnecessary join through Aggregate Key: SPARK-35442 URL: https://issues.apache.org/jira/browse/SPARK-35442 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: XiDuo You If Aggregate and Join has the same output partitioning, the plan will look like {code:java} SortMergeJoin Sort HashAggregate Shuffle Sort xxx{code} To imporve the `EliminateUnnecessaryJoin`, we can support this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35442) Eliminate unnecessary join through Aggregate
[ https://issues.apache.org/jira/browse/SPARK-35442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-35442: -- Description: If Aggregate and Join have the same output partitioning, the plan will look like: {code:java} SortMergeJoin Sort HashAggregate Shuffle Sort xxx{code} To imporve the `EliminateUnnecessaryJoin`, we can support this case. was: If Aggregate and Join has the same output partitioning, the plan will look like {code:java} SortMergeJoin Sort HashAggregate Shuffle Sort xxx{code} To imporve the `EliminateUnnecessaryJoin`, we can support this case. > Eliminate unnecessary join through Aggregate > > > Key: SPARK-35442 > URL: https://issues.apache.org/jira/browse/SPARK-35442 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: XiDuo You >Priority: Minor > > If Aggregate and Join have the same output partitioning, the plan will look > like: > {code:java} > SortMergeJoin >Sort > HashAggregate >Shuffle >Sort > xxx{code} > > To imporve the `EliminateUnnecessaryJoin`, we can support this case. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35442) Eliminate unnecessary join through Aggregate
[ https://issues.apache.org/jira/browse/SPARK-35442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-35442: -- Description: If Aggregate and Join have the same output partitioning, the plan will look like: {code:java} SortMergeJoin Sort HashAggregate Shuffle Sort xxx{code} Currently `EliminateUnnecessaryJoin` doesn't support optimize this case. Logically, if the Aggregate grouping expression is not empty, we can safe eliminate it. was: If Aggregate and Join have the same output partitioning, the plan will look like: {code:java} SortMergeJoin Sort HashAggregate Shuffle Sort xxx{code} To imporve the `EliminateUnnecessaryJoin`, we can support this case. > Eliminate unnecessary join through Aggregate > > > Key: SPARK-35442 > URL: https://issues.apache.org/jira/browse/SPARK-35442 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: XiDuo You >Priority: Minor > > If Aggregate and Join have the same output partitioning, the plan will look > like: > {code:java} > SortMergeJoin >Sort > HashAggregate >Shuffle >Sort > xxx{code} > > Currently `EliminateUnnecessaryJoin` doesn't support optimize this case. > Logically, if the Aggregate grouping expression is not empty, we can safe > eliminate it. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35442) Eliminate unnecessary join through Aggregate
[ https://issues.apache.org/jira/browse/SPARK-35442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-35442: -- Description: If Aggregate and Join have the same output partitioning, the plan will look like: {code:java} SortMergeJoin Sort HashAggregate Shuffle Sort xxx{code} Currently `EliminateUnnecessaryJoin` doesn't support optimize this case. Logically, if the Aggregate grouping expression is not empty, we can eliminate it safely. was: If Aggregate and Join have the same output partitioning, the plan will look like: {code:java} SortMergeJoin Sort HashAggregate Shuffle Sort xxx{code} Currently `EliminateUnnecessaryJoin` doesn't support optimize this case. Logically, if the Aggregate grouping expression is not empty, we can safe eliminate it. > Eliminate unnecessary join through Aggregate > > > Key: SPARK-35442 > URL: https://issues.apache.org/jira/browse/SPARK-35442 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: XiDuo You >Priority: Minor > > If Aggregate and Join have the same output partitioning, the plan will look > like: > {code:java} > SortMergeJoin >Sort > HashAggregate >Shuffle >Sort > xxx{code} > > Currently `EliminateUnnecessaryJoin` doesn't support optimize this case. > Logically, if the Aggregate grouping expression is not empty, we can > eliminate it safely. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26284) Spark History server object vs file storage behavior difference
[ https://issues.apache.org/jira/browse/SPARK-26284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347496#comment-17347496 ] Steve Loughran commented on SPARK-26284: # s3:// URLs mean you are using EMR? If so: take it up with them. Or, if you are using a very old version of hadoop, move up to a newer build with s3a. It still won't fix this problem, but you will be in supported code. Now, please look at the comment above yours, and reread, especially the bit where I explain why this is WONTFIX. thx > Spark History server object vs file storage behavior difference > --- > > Key: SPARK-26284 > URL: https://issues.apache.org/jira/browse/SPARK-26284 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Damien Doucet-Girard >Priority: Minor > > I am using the spark history server in order to view running/complete jobs on > spark using the kubernetes scheduling backend introduced in 2.3.0. Using a > local file path in both {color:#33}{{spark.eventLog.dir}}{color} and > {{spark.history.fs.logDirectory}}, I have no issue seeing both incomplete and > completed tasks, with {{.inprogress}} files being flushed regularly. However, > when using an {{s3a://}} path, it seems the calls to flush the file > ([https://github.com/apache/spark/blob/dd518a196c2d40ae48034b8b0950d1c8045c02ed/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L152-L154)] > don't actually upload the file to s3. Due to this, I am unable to see > currently incomplete tasks using an s3a path. > From the behavior I've observed, it only uploads on completion of the task > (hadoop 2.7) or upon the log file filling up the block size set for s3a > {{spark.hadoop.fs.s3a.multipart.size}} (hadoop 3.0.0). Is this intended > behavior? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35418) Add sentences function to functions.{scala,py}
[ https://issues.apache.org/jira/browse/SPARK-35418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-35418. Fix Version/s: 3.2.0 Resolution: Fixed Resolved in https://github.com/apache/spark/pull/32566. > Add sentences function to functions.{scala,py} > -- > > Key: SPARK-35418 > URL: https://issues.apache.org/jira/browse/SPARK-35418 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.2.0 > > > Spark SQL provides sentences function but it can be used only from SQL. > It's good if we can use it from Scala and Python code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35443) Mark K8s secrets and config maps as immutable
Ashray Jain created SPARK-35443: --- Summary: Mark K8s secrets and config maps as immutable Key: SPARK-35443 URL: https://issues.apache.org/jira/browse/SPARK-35443 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.1.1 Reporter: Ashray Jain Kubernetes supports marking secrets and config maps as immutable to gain performance. [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable] [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable] For K8s clusters that run many thousands of Spark applications, this can yield significant reduction in load on the kube-apiserver. >From the K8s docs: {quote}For clusters that extensively use Secrets (at least tens of thousands of unique Secret to Pod mounts), preventing changes to their data has the following advantages: * protects you from accidental (or unwanted) updates that could cause applications outages * improves performance of your cluster by significantly reducing load on kube-apiserver, by closing watches for secrets marked as immutable.{quote} For any secrets and config maps we create in Spark that are immutable, we could mark them as immutable by including the following when building the secret/config map {code:java} .withImmutable(true) {code} This feature has been supported in K8s as beta since K8s 1.19 and as GA since K8s 1.21 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35443) Mark K8s secrets and config maps as immutable
[ https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35443: Assignee: Apache Spark > Mark K8s secrets and config maps as immutable > - > > Key: SPARK-35443 > URL: https://issues.apache.org/jira/browse/SPARK-35443 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Ashray Jain >Assignee: Apache Spark >Priority: Minor > Labels: kubernetes > > Kubernetes supports marking secrets and config maps as immutable to gain > performance. > [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable] > [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable] > For K8s clusters that run many thousands of Spark applications, this can > yield significant reduction in load on the kube-apiserver. > From the K8s docs: > {quote}For clusters that extensively use Secrets (at least tens of thousands > of unique Secret to Pod mounts), preventing changes to their data has the > following advantages: > * protects you from accidental (or unwanted) updates that could cause > applications outages > * improves performance of your cluster by significantly reducing load on > kube-apiserver, by closing watches for secrets marked as immutable.{quote} > > For any secrets and config maps we create in Spark that are immutable, we > could mark them as immutable by including the following when building the > secret/config map > {code:java} > .withImmutable(true) > {code} > This feature has been supported in K8s as beta since K8s 1.19 and as GA since > K8s 1.21 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35443) Mark K8s secrets and config maps as immutable
[ https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347599#comment-17347599 ] Apache Spark commented on SPARK-35443: -- User 'ashrayjain' has created a pull request for this issue: https://github.com/apache/spark/pull/32588 > Mark K8s secrets and config maps as immutable > - > > Key: SPARK-35443 > URL: https://issues.apache.org/jira/browse/SPARK-35443 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Ashray Jain >Priority: Minor > Labels: kubernetes > > Kubernetes supports marking secrets and config maps as immutable to gain > performance. > [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable] > [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable] > For K8s clusters that run many thousands of Spark applications, this can > yield significant reduction in load on the kube-apiserver. > From the K8s docs: > {quote}For clusters that extensively use Secrets (at least tens of thousands > of unique Secret to Pod mounts), preventing changes to their data has the > following advantages: > * protects you from accidental (or unwanted) updates that could cause > applications outages > * improves performance of your cluster by significantly reducing load on > kube-apiserver, by closing watches for secrets marked as immutable.{quote} > > For any secrets and config maps we create in Spark that are immutable, we > could mark them as immutable by including the following when building the > secret/config map > {code:java} > .withImmutable(true) > {code} > This feature has been supported in K8s as beta since K8s 1.19 and as GA since > K8s 1.21 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35443) Mark K8s secrets and config maps as immutable
[ https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35443: Assignee: (was: Apache Spark) > Mark K8s secrets and config maps as immutable > - > > Key: SPARK-35443 > URL: https://issues.apache.org/jira/browse/SPARK-35443 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Ashray Jain >Priority: Minor > Labels: kubernetes > > Kubernetes supports marking secrets and config maps as immutable to gain > performance. > [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable] > [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable] > For K8s clusters that run many thousands of Spark applications, this can > yield significant reduction in load on the kube-apiserver. > From the K8s docs: > {quote}For clusters that extensively use Secrets (at least tens of thousands > of unique Secret to Pod mounts), preventing changes to their data has the > following advantages: > * protects you from accidental (or unwanted) updates that could cause > applications outages > * improves performance of your cluster by significantly reducing load on > kube-apiserver, by closing watches for secrets marked as immutable.{quote} > > For any secrets and config maps we create in Spark that are immutable, we > could mark them as immutable by including the following when building the > secret/config map > {code:java} > .withImmutable(true) > {code} > This feature has been supported in K8s as beta since K8s 1.19 and as GA since > K8s 1.21 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35443) Mark K8s secrets and config maps as immutable
[ https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347600#comment-17347600 ] Apache Spark commented on SPARK-35443: -- User 'ashrayjain' has created a pull request for this issue: https://github.com/apache/spark/pull/32588 > Mark K8s secrets and config maps as immutable > - > > Key: SPARK-35443 > URL: https://issues.apache.org/jira/browse/SPARK-35443 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Ashray Jain >Priority: Minor > Labels: kubernetes > > Kubernetes supports marking secrets and config maps as immutable to gain > performance. > [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable] > [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable] > For K8s clusters that run many thousands of Spark applications, this can > yield significant reduction in load on the kube-apiserver. > From the K8s docs: > {quote}For clusters that extensively use Secrets (at least tens of thousands > of unique Secret to Pod mounts), preventing changes to their data has the > following advantages: > * protects you from accidental (or unwanted) updates that could cause > applications outages > * improves performance of your cluster by significantly reducing load on > kube-apiserver, by closing watches for secrets marked as immutable.{quote} > > For any secrets and config maps we create in Spark that are immutable, we > could mark them as immutable by including the following when building the > secret/config map > {code:java} > .withImmutable(true) > {code} > This feature has been supported in K8s as beta since K8s 1.19 and as GA since > K8s 1.21 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35444) Improve createTable logic if table exists
yikf created SPARK-35444: Summary: Improve createTable logic if table exists Key: SPARK-35444 URL: https://issues.apache.org/jira/browse/SPARK-35444 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: yikf Fix For: 3.0.0 It should return directly if table exists and ignoreIfExists=true -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35444) Improve createTable logic if table exists
[ https://issues.apache.org/jira/browse/SPARK-35444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yikf updated SPARK-35444: - Description: It should return directly if table exists and ignoreIfExists=true current logic: {code:java} requireDbExists(db) if (tableExists(newTableDefinition.identifier)) { if (!ignoreIfExists) { throw new TableAlreadyExistsException(db = db, table = table) } } else if (validateLocation) { validateTableLocation(newTableDefinition) } externalCatalog.createTable(newTableDefinition, ignoreIfExists) {code} was:It should return directly if table exists and ignoreIfExists=true > Improve createTable logic if table exists > - > > Key: SPARK-35444 > URL: https://issues.apache.org/jira/browse/SPARK-35444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: yikf >Priority: Major > Fix For: 3.0.0 > > > It should return directly if table exists and ignoreIfExists=true > > current logic: > {code:java} > requireDbExists(db) > if (tableExists(newTableDefinition.identifier)) { > if (!ignoreIfExists) { > throw new TableAlreadyExistsException(db = db, table = table) > } > } else if (validateLocation) { > validateTableLocation(newTableDefinition) > } > externalCatalog.createTable(newTableDefinition, ignoreIfExists) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35444) Improve createTable logic if table exists
[ https://issues.apache.org/jira/browse/SPARK-35444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347619#comment-17347619 ] Apache Spark commented on SPARK-35444: -- User 'yikf' has created a pull request for this issue: https://github.com/apache/spark/pull/32589 > Improve createTable logic if table exists > - > > Key: SPARK-35444 > URL: https://issues.apache.org/jira/browse/SPARK-35444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: yikf >Priority: Major > Fix For: 3.0.0 > > > It should return directly if table exists and ignoreIfExists=true > > current logic: > {code:java} > requireDbExists(db) > if (tableExists(newTableDefinition.identifier)) { > if (!ignoreIfExists) { > throw new TableAlreadyExistsException(db = db, table = table) > } > } else if (validateLocation) { > validateTableLocation(newTableDefinition) > } > externalCatalog.createTable(newTableDefinition, ignoreIfExists) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35444) Improve createTable logic if table exists
[ https://issues.apache.org/jira/browse/SPARK-35444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35444: Assignee: Apache Spark > Improve createTable logic if table exists > - > > Key: SPARK-35444 > URL: https://issues.apache.org/jira/browse/SPARK-35444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: yikf >Assignee: Apache Spark >Priority: Major > Fix For: 3.0.0 > > > It should return directly if table exists and ignoreIfExists=true > > current logic: > {code:java} > requireDbExists(db) > if (tableExists(newTableDefinition.identifier)) { > if (!ignoreIfExists) { > throw new TableAlreadyExistsException(db = db, table = table) > } > } else if (validateLocation) { > validateTableLocation(newTableDefinition) > } > externalCatalog.createTable(newTableDefinition, ignoreIfExists) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35444) Improve createTable logic if table exists
[ https://issues.apache.org/jira/browse/SPARK-35444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35444: Assignee: (was: Apache Spark) > Improve createTable logic if table exists > - > > Key: SPARK-35444 > URL: https://issues.apache.org/jira/browse/SPARK-35444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: yikf >Priority: Major > Fix For: 3.0.0 > > > It should return directly if table exists and ignoreIfExists=true > > current logic: > {code:java} > requireDbExists(db) > if (tableExists(newTableDefinition.identifier)) { > if (!ignoreIfExists) { > throw new TableAlreadyExistsException(db = db, table = table) > } > } else if (validateLocation) { > validateTableLocation(newTableDefinition) > } > externalCatalog.createTable(newTableDefinition, ignoreIfExists) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35444) Improve createTable logic if table exists
[ https://issues.apache.org/jira/browse/SPARK-35444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347620#comment-17347620 ] Apache Spark commented on SPARK-35444: -- User 'yikf' has created a pull request for this issue: https://github.com/apache/spark/pull/32589 > Improve createTable logic if table exists > - > > Key: SPARK-35444 > URL: https://issues.apache.org/jira/browse/SPARK-35444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: yikf >Priority: Major > Fix For: 3.0.0 > > > It should return directly if table exists and ignoreIfExists=true > > current logic: > {code:java} > requireDbExists(db) > if (tableExists(newTableDefinition.identifier)) { > if (!ignoreIfExists) { > throw new TableAlreadyExistsException(db = db, table = table) > } > } else if (validateLocation) { > validateTableLocation(newTableDefinition) > } > externalCatalog.createTable(newTableDefinition, ignoreIfExists) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35362) Update null count in the column stats for UNION stats estimation
[ https://issues.apache.org/jira/browse/SPARK-35362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-35362. -- Fix Version/s: 3.2.0 Assignee: shahid Resolution: Fixed Resolved by https://github.com/apache/spark/pull/32494 > Update null count in the column stats for UNION stats estimation > > > Key: SPARK-35362 > URL: https://issues.apache.org/jira/browse/SPARK-35362 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.0.2 >Reporter: shahid >Assignee: shahid >Priority: Minor > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35093) AQE columnar mismatch on exchange reuse
[ https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-35093. --- Fix Version/s: 3.2.0 3.1.2 3.0.3 Assignee: Andy Grove Resolution: Fixed > AQE columnar mismatch on exchange reuse > --- > > Key: SPARK-35093 > URL: https://issues.apache.org/jira/browse/SPARK-35093 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that > are semantically equal. > This is done by comparing the canonicalized plan for two Exchange nodes to > see if they are the same. > Unfortunately this does not take into account the fact that two exchanges > with the same canonical plan might be replaced by a plugin in a way that > makes them not compatible. For example, a plugin could create one version > with supportsColumnar=true and another with supportsColumnar=false. It is not > valid to re-use exchanges if there is a supportsColumnar mismatch. > I have tested a fix for this and will put up a PR once I figure out how to > write the tests. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35445) Reduce the execution time of DeduplicateRelations
wuyi created SPARK-35445: Summary: Reduce the execution time of DeduplicateRelations Key: SPARK-35445 URL: https://issues.apache.org/jira/browse/SPARK-35445 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: wuyi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35373) Verify checksums of downloaded artifacts in build/mvn
[ https://issues.apache.org/jira/browse/SPARK-35373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-35373: - Fix Version/s: 3.1.2 3.0.3 > Verify checksums of downloaded artifacts in build/mvn > - > > Key: SPARK-35373 > URL: https://issues.apache.org/jira/browse/SPARK-35373 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Sean R. Owen >Assignee: Apache Spark >Priority: Minor > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > build/mvn is a convenience script that will automatically download Maven (and > Scala) if not already present. While it downloads from official ASF mirrors, > it does not check the checksum of the artifact, which is available as a > .sha512 file from ASF servers. > The risk of a supply chain attack is a bit less theoretical here than usual, > because artifacts are downloaded from any of several mirrors worldwide, and > injecting a malicious copy of Maven in any one of them might be simpler and > less noticeable than injecting it into ASF servers. > (Note, Scala's download site does not seem to provide a checksum. They do all > come from Lightbend, at least, not N mirrors. Not much we can do there.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35445) Reduce the execution time of DeduplicateRelations
[ https://issues.apache.org/jira/browse/SPARK-35445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35445: Assignee: Apache Spark > Reduce the execution time of DeduplicateRelations > - > > Key: SPARK-35445 > URL: https://issues.apache.org/jira/browse/SPARK-35445 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35445) Reduce the execution time of DeduplicateRelations
[ https://issues.apache.org/jira/browse/SPARK-35445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35445: Assignee: (was: Apache Spark) > Reduce the execution time of DeduplicateRelations > - > > Key: SPARK-35445 > URL: https://issues.apache.org/jira/browse/SPARK-35445 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: wuyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35445) Reduce the execution time of DeduplicateRelations
[ https://issues.apache.org/jira/browse/SPARK-35445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347706#comment-17347706 ] Apache Spark commented on SPARK-35445: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/32590 > Reduce the execution time of DeduplicateRelations > - > > Key: SPARK-35445 > URL: https://issues.apache.org/jira/browse/SPARK-35445 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: wuyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35445) Reduce the execution time of DeduplicateRelations
[ https://issues.apache.org/jira/browse/SPARK-35445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347704#comment-17347704 ] Apache Spark commented on SPARK-35445: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/32590 > Reduce the execution time of DeduplicateRelations > - > > Key: SPARK-35445 > URL: https://issues.apache.org/jira/browse/SPARK-35445 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: wuyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35446) Override getJDBCType in MySQLDialect to map FloatType to FLOAT
Marios Meimaris created SPARK-35446: --- Summary: Override getJDBCType in MySQLDialect to map FloatType to FLOAT Key: SPARK-35446 URL: https://issues.apache.org/jira/browse/SPARK-35446 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.1 Reporter: Marios Meimaris In [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L165,] FloatType is mapped to REAL. However, MySQL treats REAL as a synonym to DOUBLE by default (see [https://dev.mysql.com/doc/refman/8.0/en/numeric-types.html] for more information). `MySQLDialect` could override `getJDBCType` so that it maps FloatType to FLOAT instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35373) Verify checksums of downloaded artifacts in build/mvn
[ https://issues.apache.org/jira/browse/SPARK-35373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347765#comment-17347765 ] Apache Spark commented on SPARK-35373: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/32591 > Verify checksums of downloaded artifacts in build/mvn > - > > Key: SPARK-35373 > URL: https://issues.apache.org/jira/browse/SPARK-35373 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Sean R. Owen >Assignee: Apache Spark >Priority: Minor > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > build/mvn is a convenience script that will automatically download Maven (and > Scala) if not already present. While it downloads from official ASF mirrors, > it does not check the checksum of the artifact, which is available as a > .sha512 file from ASF servers. > The risk of a supply chain attack is a bit less theoretical here than usual, > because artifacts are downloaded from any of several mirrors worldwide, and > injecting a malicious copy of Maven in any one of them might be simpler and > less noticeable than injecting it into ASF servers. > (Note, Scala's download site does not seem to provide a checksum. They do all > come from Lightbend, at least, not N mirrors. Not much we can do there.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35343) Make conversion from/to pandas data-type-based
[ https://issues.apache.org/jira/browse/SPARK-35343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35343: Assignee: Apache Spark > Make conversion from/to pandas data-type-based > -- > > Key: SPARK-35343 > URL: https://issues.apache.org/jira/browse/SPARK-35343 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35343) Make conversion from/to pandas data-type-based
[ https://issues.apache.org/jira/browse/SPARK-35343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347808#comment-17347808 ] Apache Spark commented on SPARK-35343: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/32592 > Make conversion from/to pandas data-type-based > -- > > Key: SPARK-35343 > URL: https://issues.apache.org/jira/browse/SPARK-35343 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35343) Make conversion from/to pandas data-type-based
[ https://issues.apache.org/jira/browse/SPARK-35343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347809#comment-17347809 ] Apache Spark commented on SPARK-35343: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/32592 > Make conversion from/to pandas data-type-based > -- > > Key: SPARK-35343 > URL: https://issues.apache.org/jira/browse/SPARK-35343 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35343) Make conversion from/to pandas data-type-based
[ https://issues.apache.org/jira/browse/SPARK-35343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35343: Assignee: (was: Apache Spark) > Make conversion from/to pandas data-type-based > -- > > Key: SPARK-35343 > URL: https://issues.apache.org/jira/browse/SPARK-35343 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35447) optimize skew join before coalescing shuffle partitions
Wenchen Fan created SPARK-35447: --- Summary: optimize skew join before coalescing shuffle partitions Key: SPARK-35447 URL: https://issues.apache.org/jira/browse/SPARK-35447 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32304) Flaky Test: AdaptiveQueryExecSuite.multiple joins
[ https://issues.apache.org/jira/browse/SPARK-32304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347817#comment-17347817 ] Andy Grove commented on SPARK-32304: [~dongjoon] This is the issue we just saw in branch-3.0. > Flaky Test: AdaptiveQueryExecSuite.multiple joins > - > > Key: SPARK-32304 > URL: https://issues.apache.org/jira/browse/SPARK-32304 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > AdaptiveQueryExecSuite: > - Change merge join to broadcast join (313 milliseconds) > - Reuse the parallelism of CoalescedShuffleReaderExec in > LocalShuffleReaderExec (265 milliseconds) > - Reuse the default parallelism in LocalShuffleReaderExec (230 milliseconds) > - Empty stage coalesced to 0-partition RDD (514 milliseconds) > - Scalar subquery (406 milliseconds) > - Scalar subquery in later stages (500 milliseconds) > - multiple joins *** FAILED *** (739 milliseconds) > ArrayBuffer(BroadcastHashJoin [b#251429], [a#251438], Inner, BuildLeft > :- BroadcastQueryStage 5 > : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[3, int, > false] as bigint))), [id=#504817] > : +- CustomShuffleReader local > :+- ShuffleQueryStage 4 > : +- Exchange hashpartitioning(b#251429, 5), true, [id=#504777] > : +- *(7) SortMergeJoin [key#251418], [a#251428], Inner > : :- *(5) Sort [key#251418 ASC NULLS FIRST], false, 0 > : : +- CustomShuffleReader coalesced > : : +- ShuffleQueryStage 0 > : :+- Exchange hashpartitioning(key#251418, 5), > true, [id=#504656] > : : +- *(1) Filter (isnotnull(value#251419) AND > (cast(value#251419 as int) = 1)) > : : +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#251418, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) > AS value#251419] > : : +- Scan[obj#251417] > : +- *(6) Sort [a#251428 ASC NULLS FIRST], false, 0 > :+- CustomShuffleReader coalesced > : +- ShuffleQueryStage 1 > : +- Exchange hashpartitioning(a#251428, 5), true, > [id=#504663] > : +- *(2) Filter (b#251429 = 1) > :+- *(2) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#251428, > knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#251429] > : +- Scan[obj#251427] > +- BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight > :- CustomShuffleReader local > : +- ShuffleQueryStage 2 > : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680] > :+- *(3) Filter (n#251498 = 1) > : +- *(3) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).l, true, false) > AS l#251499] > : +- Scan[obj#251497] > +- BroadcastQueryStage 6 > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, > int, false] as bigint))), [id=#504830] >+- CustomShuffleReader local > +- ShuffleQueryStage 3 > +- Exchange hashpartitioning(a#251438, 5), true, [id=#504694] > +- *(4) Filter (a#251438 = 1) >+- *(4) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#251438, > unwrapoption(IntegerType, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData3, true])).b) AS b#251439] > +- Scan[obj#251437] > , BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight > :- CustomShuffleReader local > : +- ShuffleQueryStage 2 > : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680] > :+- *(3) Filter (n#251498 = 1) > : +- *(3) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$Lowe
[jira] [Assigned] (SPARK-35447) optimize skew join before coalescing shuffle partitions
[ https://issues.apache.org/jira/browse/SPARK-35447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35447: Assignee: (was: Apache Spark) > optimize skew join before coalescing shuffle partitions > --- > > Key: SPARK-35447 > URL: https://issues.apache.org/jira/browse/SPARK-35447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35447) optimize skew join before coalescing shuffle partitions
[ https://issues.apache.org/jira/browse/SPARK-35447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347819#comment-17347819 ] Apache Spark commented on SPARK-35447: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/32594 > optimize skew join before coalescing shuffle partitions > --- > > Key: SPARK-35447 > URL: https://issues.apache.org/jira/browse/SPARK-35447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35447) optimize skew join before coalescing shuffle partitions
[ https://issues.apache.org/jira/browse/SPARK-35447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35447: Assignee: Apache Spark > optimize skew join before coalescing shuffle partitions > --- > > Key: SPARK-35447 > URL: https://issues.apache.org/jira/browse/SPARK-35447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35447) optimize skew join before coalescing shuffle partitions
[ https://issues.apache.org/jira/browse/SPARK-35447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347820#comment-17347820 ] Apache Spark commented on SPARK-35447: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/32594 > optimize skew join before coalescing shuffle partitions > --- > > Key: SPARK-35447 > URL: https://issues.apache.org/jira/browse/SPARK-35447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35448) Subexpression elimination enhancements
L. C. Hsieh created SPARK-35448: --- Summary: Subexpression elimination enhancements Key: SPARK-35448 URL: https://issues.apache.org/jira/browse/SPARK-35448 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 3.2.0 Reporter: L. C. Hsieh -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35439) Children subexpr should come first than parent subexpr in subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-35439: Parent: SPARK-35448 Issue Type: Sub-task (was: Improvement) > Children subexpr should come first than parent subexpr in subexpression > elimination > --- > > Key: SPARK-35439 > URL: https://issues.apache.org/jira/browse/SPARK-35439 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > EquivalentExpressions maintains a map of equivalent expressions. It is > HashMap now so the insertion order is not guaranteed to be preserved later. > Subexpression elimination relies on retrieving subexpressions from the map. > If there is child-parent relationships among the subexpressions, we want the > child expressions come first than parent expressions, so we can replace child > expressions in parent expressions with subexpression evaluation. > For example, we have two different expressions Add(Literal(1), Literal(2)) > and Add(Literal(3), add). > Case 1: child subexpr comes first. Replacing HashMap with LinkedHashMap can > deal with it. > addExprTree(add) > addExprTree(Add(Literal(3), add)) > addExprTree(Add(Literal(3), add)) > Case 2: parent subexpr comes first. For this case, we need to sort equivalent > expressions. > addExprTree(Add(Literal(3), add)) => We add `Add(Literal(3), add)` into the > map first, then add `add` into the map > addExprTree(add) > addExprTree(Add(Literal(3), add)) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35410) Unused subexpressions leftover in WholeStageCodegen subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-35410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-35410: Parent: SPARK-35448 Issue Type: Sub-task (was: Bug) > Unused subexpressions leftover in WholeStageCodegen subexpression elimination > - > > Key: SPARK-35410 > URL: https://issues.apache.org/jira/browse/SPARK-35410 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: Adam Binford >Priority: Major > Attachments: codegen.txt > > > Trying to understand and debug the performance of some of our jobs, I started > digging into what the whole stage codegen code was doing. We use a lot of > case when statements, and I found that there were a lot of unused sub > expressions that were left over after the subexpression elimination, and it > gets worse the more expressions you have chained. The simple example: > {code:java} > import org.apache.spark.sql.functions._ > import spark.implicits._ > val myUdf = udf((s: String) => { > println("In UDF") > s.toUpperCase > }) > spark.range(5).select(when(length(myUdf($"id")) > 0, > length(myUdf($"id".show() > {code} > Running the code, you'll see "In UDF" printed out 10 times. And if you change > both to log(length(myUdf($"id")), "In UDF" will print out 20 times (one more > for a cast from int to double and one more for the actual log calculation I > think). > In the codegen for this (without the log), there are these initial > subexpressions: > {code:java} > /* 076 */ UTF8String project_subExprValue_0 = > project_subExpr_0(project_expr_0_0); > /* 077 */ int project_subExprValue_1 = > project_subExpr_1(project_expr_0_0); > /* 078 */ UTF8String project_subExprValue_2 = > project_subExpr_2(project_expr_0_0); > {code} > project_subExprValue_0 and project_subExprValue_2 are never actually used, so > it's properly resolving the two expressions and sharing the result of > project_subExprValue_1, but it's not removing the other sub expression calls > it seems like. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347842#comment-17347842 ] L. C. Hsieh commented on SPARK-35449: - cc [~Kimahriman] > Should not extract common expressions from value expressions when elseValue > is empty in CaseWhen > > > Key: SPARK-35449 > URL: https://issues.apache.org/jira/browse/SPARK-35449 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen
L. C. Hsieh created SPARK-35449: --- Summary: Should not extract common expressions from value expressions when elseValue is empty in CaseWhen Key: SPARK-35449 URL: https://issues.apache.org/jira/browse/SPARK-35449 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: L. C. Hsieh -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32304) Flaky Test: AdaptiveQueryExecSuite.multiple joins
[ https://issues.apache.org/jira/browse/SPARK-32304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347871#comment-17347871 ] Dongjoon Hyun commented on SPARK-32304: --- Oh, got it. Thank you, [~andygrove]! > Flaky Test: AdaptiveQueryExecSuite.multiple joins > - > > Key: SPARK-32304 > URL: https://issues.apache.org/jira/browse/SPARK-32304 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > AdaptiveQueryExecSuite: > - Change merge join to broadcast join (313 milliseconds) > - Reuse the parallelism of CoalescedShuffleReaderExec in > LocalShuffleReaderExec (265 milliseconds) > - Reuse the default parallelism in LocalShuffleReaderExec (230 milliseconds) > - Empty stage coalesced to 0-partition RDD (514 milliseconds) > - Scalar subquery (406 milliseconds) > - Scalar subquery in later stages (500 milliseconds) > - multiple joins *** FAILED *** (739 milliseconds) > ArrayBuffer(BroadcastHashJoin [b#251429], [a#251438], Inner, BuildLeft > :- BroadcastQueryStage 5 > : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[3, int, > false] as bigint))), [id=#504817] > : +- CustomShuffleReader local > :+- ShuffleQueryStage 4 > : +- Exchange hashpartitioning(b#251429, 5), true, [id=#504777] > : +- *(7) SortMergeJoin [key#251418], [a#251428], Inner > : :- *(5) Sort [key#251418 ASC NULLS FIRST], false, 0 > : : +- CustomShuffleReader coalesced > : : +- ShuffleQueryStage 0 > : :+- Exchange hashpartitioning(key#251418, 5), > true, [id=#504656] > : : +- *(1) Filter (isnotnull(value#251419) AND > (cast(value#251419 as int) = 1)) > : : +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#251418, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) > AS value#251419] > : : +- Scan[obj#251417] > : +- *(6) Sort [a#251428 ASC NULLS FIRST], false, 0 > :+- CustomShuffleReader coalesced > : +- ShuffleQueryStage 1 > : +- Exchange hashpartitioning(a#251428, 5), true, > [id=#504663] > : +- *(2) Filter (b#251429 = 1) > :+- *(2) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#251428, > knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#251429] > : +- Scan[obj#251427] > +- BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight > :- CustomShuffleReader local > : +- ShuffleQueryStage 2 > : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680] > :+- *(3) Filter (n#251498 = 1) > : +- *(3) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).l, true, false) > AS l#251499] > : +- Scan[obj#251497] > +- BroadcastQueryStage 6 > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, > int, false] as bigint))), [id=#504830] >+- CustomShuffleReader local > +- ShuffleQueryStage 3 > +- Exchange hashpartitioning(a#251438, 5), true, [id=#504694] > +- *(4) Filter (a#251438 = 1) >+- *(4) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#251438, > unwrapoption(IntegerType, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData3, true])).b) AS b#251439] > +- Scan[obj#251437] > , BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight > :- CustomShuffleReader local > : +- ShuffleQueryStage 2 > : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680] > :+- *(3) Filter (n#251498 = 1) > : +- *(3) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n
[jira] [Updated] (SPARK-32304) Flaky Test: AdaptiveQueryExecSuite.multiple joins
[ https://issues.apache.org/jira/browse/SPARK-32304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32304: -- Parent: SPARK-33828 Issue Type: Sub-task (was: Bug) > Flaky Test: AdaptiveQueryExecSuite.multiple joins > - > > Key: SPARK-32304 > URL: https://issues.apache.org/jira/browse/SPARK-32304 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > AdaptiveQueryExecSuite: > - Change merge join to broadcast join (313 milliseconds) > - Reuse the parallelism of CoalescedShuffleReaderExec in > LocalShuffleReaderExec (265 milliseconds) > - Reuse the default parallelism in LocalShuffleReaderExec (230 milliseconds) > - Empty stage coalesced to 0-partition RDD (514 milliseconds) > - Scalar subquery (406 milliseconds) > - Scalar subquery in later stages (500 milliseconds) > - multiple joins *** FAILED *** (739 milliseconds) > ArrayBuffer(BroadcastHashJoin [b#251429], [a#251438], Inner, BuildLeft > :- BroadcastQueryStage 5 > : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[3, int, > false] as bigint))), [id=#504817] > : +- CustomShuffleReader local > :+- ShuffleQueryStage 4 > : +- Exchange hashpartitioning(b#251429, 5), true, [id=#504777] > : +- *(7) SortMergeJoin [key#251418], [a#251428], Inner > : :- *(5) Sort [key#251418 ASC NULLS FIRST], false, 0 > : : +- CustomShuffleReader coalesced > : : +- ShuffleQueryStage 0 > : :+- Exchange hashpartitioning(key#251418, 5), > true, [id=#504656] > : : +- *(1) Filter (isnotnull(value#251419) AND > (cast(value#251419 as int) = 1)) > : : +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#251418, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) > AS value#251419] > : : +- Scan[obj#251417] > : +- *(6) Sort [a#251428 ASC NULLS FIRST], false, 0 > :+- CustomShuffleReader coalesced > : +- ShuffleQueryStage 1 > : +- Exchange hashpartitioning(a#251428, 5), true, > [id=#504663] > : +- *(2) Filter (b#251429 = 1) > :+- *(2) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#251428, > knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#251429] > : +- Scan[obj#251427] > +- BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight > :- CustomShuffleReader local > : +- ShuffleQueryStage 2 > : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680] > :+- *(3) Filter (n#251498 = 1) > : +- *(3) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).l, true, false) > AS l#251499] > : +- Scan[obj#251497] > +- BroadcastQueryStage 6 > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, > int, false] as bigint))), [id=#504830] >+- CustomShuffleReader local > +- ShuffleQueryStage 3 > +- Exchange hashpartitioning(a#251438, 5), true, [id=#504694] > +- *(4) Filter (a#251438 = 1) >+- *(4) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#251438, > unwrapoption(IntegerType, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData3, true])).b) AS b#251439] > +- Scan[obj#251437] > , BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight > :- CustomShuffleReader local > : +- ShuffleQueryStage 2 > : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680] > :+- *(3) Filter (n#251498 = 1) > : +- *(3) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, > staticinvoke(
[jira] [Updated] (SPARK-32304) Flaky Test: AdaptiveQueryExecSuite.multiple joins
[ https://issues.apache.org/jira/browse/SPARK-32304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32304: -- Parent: (was: SPARK-32244) Issue Type: Bug (was: Sub-task) > Flaky Test: AdaptiveQueryExecSuite.multiple joins > - > > Key: SPARK-32304 > URL: https://issues.apache.org/jira/browse/SPARK-32304 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > AdaptiveQueryExecSuite: > - Change merge join to broadcast join (313 milliseconds) > - Reuse the parallelism of CoalescedShuffleReaderExec in > LocalShuffleReaderExec (265 milliseconds) > - Reuse the default parallelism in LocalShuffleReaderExec (230 milliseconds) > - Empty stage coalesced to 0-partition RDD (514 milliseconds) > - Scalar subquery (406 milliseconds) > - Scalar subquery in later stages (500 milliseconds) > - multiple joins *** FAILED *** (739 milliseconds) > ArrayBuffer(BroadcastHashJoin [b#251429], [a#251438], Inner, BuildLeft > :- BroadcastQueryStage 5 > : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[3, int, > false] as bigint))), [id=#504817] > : +- CustomShuffleReader local > :+- ShuffleQueryStage 4 > : +- Exchange hashpartitioning(b#251429, 5), true, [id=#504777] > : +- *(7) SortMergeJoin [key#251418], [a#251428], Inner > : :- *(5) Sort [key#251418 ASC NULLS FIRST], false, 0 > : : +- CustomShuffleReader coalesced > : : +- ShuffleQueryStage 0 > : :+- Exchange hashpartitioning(key#251418, 5), > true, [id=#504656] > : : +- *(1) Filter (isnotnull(value#251419) AND > (cast(value#251419 as int) = 1)) > : : +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#251418, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) > AS value#251419] > : : +- Scan[obj#251417] > : +- *(6) Sort [a#251428 ASC NULLS FIRST], false, 0 > :+- CustomShuffleReader coalesced > : +- ShuffleQueryStage 1 > : +- Exchange hashpartitioning(a#251428, 5), true, > [id=#504663] > : +- *(2) Filter (b#251429 = 1) > :+- *(2) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#251428, > knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#251429] > : +- Scan[obj#251427] > +- BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight > :- CustomShuffleReader local > : +- ShuffleQueryStage 2 > : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680] > :+- *(3) Filter (n#251498 = 1) > : +- *(3) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).l, true, false) > AS l#251499] > : +- Scan[obj#251497] > +- BroadcastQueryStage 6 > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, > int, false] as bigint))), [id=#504830] >+- CustomShuffleReader local > +- ShuffleQueryStage 3 > +- Exchange hashpartitioning(a#251438, 5), true, [id=#504694] > +- *(4) Filter (a#251438 = 1) >+- *(4) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#251438, > unwrapoption(IntegerType, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData3, true])).b) AS b#251439] > +- Scan[obj#251437] > , BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight > :- CustomShuffleReader local > : +- ShuffleQueryStage 2 > : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680] > :+- *(3) Filter (n#251498 = 1) > : +- *(3) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, > statici
[jira] [Reopened] (SPARK-32304) Flaky Test: AdaptiveQueryExecSuite.multiple joins
[ https://issues.apache.org/jira/browse/SPARK-32304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-32304: --- I reopen and convert to a subtask of SPARK-33828 . If there is no more occurrence during Apache Spark 3.2.0 QA period, we can close this. > Flaky Test: AdaptiveQueryExecSuite.multiple joins > - > > Key: SPARK-32304 > URL: https://issues.apache.org/jira/browse/SPARK-32304 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > AdaptiveQueryExecSuite: > - Change merge join to broadcast join (313 milliseconds) > - Reuse the parallelism of CoalescedShuffleReaderExec in > LocalShuffleReaderExec (265 milliseconds) > - Reuse the default parallelism in LocalShuffleReaderExec (230 milliseconds) > - Empty stage coalesced to 0-partition RDD (514 milliseconds) > - Scalar subquery (406 milliseconds) > - Scalar subquery in later stages (500 milliseconds) > - multiple joins *** FAILED *** (739 milliseconds) > ArrayBuffer(BroadcastHashJoin [b#251429], [a#251438], Inner, BuildLeft > :- BroadcastQueryStage 5 > : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[3, int, > false] as bigint))), [id=#504817] > : +- CustomShuffleReader local > :+- ShuffleQueryStage 4 > : +- Exchange hashpartitioning(b#251429, 5), true, [id=#504777] > : +- *(7) SortMergeJoin [key#251418], [a#251428], Inner > : :- *(5) Sort [key#251418 ASC NULLS FIRST], false, 0 > : : +- CustomShuffleReader coalesced > : : +- ShuffleQueryStage 0 > : :+- Exchange hashpartitioning(key#251418, 5), > true, [id=#504656] > : : +- *(1) Filter (isnotnull(value#251419) AND > (cast(value#251419 as int) = 1)) > : : +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#251418, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) > AS value#251419] > : : +- Scan[obj#251417] > : +- *(6) Sort [a#251428 ASC NULLS FIRST], false, 0 > :+- CustomShuffleReader coalesced > : +- ShuffleQueryStage 1 > : +- Exchange hashpartitioning(a#251428, 5), true, > [id=#504663] > : +- *(2) Filter (b#251429 = 1) > :+- *(2) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#251428, > knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#251429] > : +- Scan[obj#251427] > +- BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight > :- CustomShuffleReader local > : +- ShuffleQueryStage 2 > : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680] > :+- *(3) Filter (n#251498 = 1) > : +- *(3) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).l, true, false) > AS l#251499] > : +- Scan[obj#251497] > +- BroadcastQueryStage 6 > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, > int, false] as bigint))), [id=#504830] >+- CustomShuffleReader local > +- ShuffleQueryStage 3 > +- Exchange hashpartitioning(a#251438, 5), true, [id=#504694] > +- *(4) Filter (a#251438 = 1) >+- *(4) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#251438, > unwrapoption(IntegerType, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData3, true])).b) AS b#251439] > +- Scan[obj#251437] > , BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight > :- CustomShuffleReader local > : +- ShuffleQueryStage 2 > : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680] > :+- *(3) Filter (n#251498 = 1) > : +- *(3) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spar
[jira] [Resolved] (SPARK-35338) Separate arithmetic operations into data type based structures
[ https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-35338. --- Fix Version/s: 3.2.0 Assignee: Xinrong Meng Resolution: Fixed Issue resolved by pull request 32469 https://github.com/apache/spark/pull/32469 > Separate arithmetic operations into data type based structures > -- > > Key: SPARK-35338 > URL: https://issues.apache.org/jira/browse/SPARK-35338 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > For arithmetic operations: > {code:java} > __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__, > __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__ > {code} > we would like to separate them into data-type-based structures. > The existing behaviors of each arithmetic operation should be preserved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35256) Subexpression elimination leading to a performance regression
[ https://issues.apache.org/jira/browse/SPARK-35256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347974#comment-17347974 ] Adam Binford commented on SPARK-35256: -- See https://issues.apache.org/jira/browse/SPARK-35448 for a few related issues for subexpression elimination being worked. Specifically https://issues.apache.org/jira/browse/SPARK-35410 is probably the main issue you're facing where `when` expressions can cause a lot of extra, unused subexpressions to be included in the codegen > Subexpression elimination leading to a performance regression > - > > Key: SPARK-35256 > URL: https://issues.apache.org/jira/browse/SPARK-35256 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Ondrej Kokes >Priority: Minor > Attachments: bisect_log.txt, bisect_timing.csv > > > I'm seeing almost double the runtime between 3.0.1 and 3.1.1 in my pipeline > that does mostly str_to_map, split and a few other operations - all > projections, no joins or aggregations (it's here only to trigger the > pipeline). I cut it down to the simplest reproducible example I could - > anything I remove from this changes the runtime difference quite > dramatically. (even moving those two expressions from f.when to standalone > columns makes the difference disappear) > {code:java} > import time > import os > import pyspark > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > if __name__ == '__main__': > print(pyspark.__version__) > spark = SparkSession.builder.getOrCreate() > filename = 'regression.csv' > if not os.path.isfile(filename): > with open(filename, 'wt') as fw: > fw.write('foo\n') > for _ in range(10_000_000): > fw.write('foo=bar&baz=bak&bar=f,o,1:2:3\n') > df = spark.read.option('header', True).csv(filename) > t = time.time() > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > .withColumn('extracted', > # without this top level split it is only 50% > slower, with it > # the runtime almost doubles > f.split(f.split(f.col("my_map")["bar"], ",")[2], > ":")[0] >) > .select( > f.when( > f.col("extracted").startswith("foo"), f.col("extracted") > ).otherwise( > f.concat(f.lit("foo"), f.col("extracted")) > ).alias("foo") > ) > ) > # dd.explain(True) > _ = dd.groupby("foo").count().count() > print("elapsed", time.time() - t) > {code} > Running this in 3.0.1 and 3.1.1 respectively (both installed from PyPI, on my > local macOS) > {code:java} > 3.0.1 > elapsed 21.262351036071777 > 3.1.1 > elapsed 40.26582884788513 > {code} > (Meaning the transformation took 21 seconds in 3.0.1 and 40 seconds in 3.1.1) > Feel free to make the CSV smaller to get a quicker feedback loop - it scales > linearly (I developed this with 2M rows). > It might be related to my previous issue - SPARK-32989 - there are similar > operations, nesting etc. (splitting on the original column, not on a map, > makes the difference disappear) > I tried dissecting the queries in SparkUI and via explain, but both 3.0.1 and > 3.1.1 produced identical plans. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35448) Subexpression elimination enhancements
[ https://issues.apache.org/jira/browse/SPARK-35448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347976#comment-17347976 ] Adam Binford commented on SPARK-35448: -- Just spitballing after looking through some of the subexpression elimination code. Does it make more sense to treat the output of EquivalentExpressions as "candidates" for subexpressions, and then it's the codegen's job to figure out which subexpressions are actually needed and used and include the code for them? > Subexpression elimination enhancements > -- > > Key: SPARK-35448 > URL: https://issues.apache.org/jira/browse/SPARK-35448 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-35448) Subexpression elimination enhancements
[ https://issues.apache.org/jira/browse/SPARK-35448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347976#comment-17347976 ] Adam Binford edited comment on SPARK-35448 at 5/19/21, 11:21 PM: - Just spitballing after looking through some of the subexpression elimination code. Does it make any sense to treat the output of EquivalentExpressions as "candidates" for subexpressions, and then it's the codegen's job to figure out which subexpressions are actually needed and used and include the code for them? was (Author: kimahriman): Just spitballing after looking through some of the subexpression elimination code. Does it make more sense to treat the output of EquivalentExpressions as "candidates" for subexpressions, and then it's the codegen's job to figure out which subexpressions are actually needed and used and include the code for them? > Subexpression elimination enhancements > -- > > Key: SPARK-35448 > URL: https://issues.apache.org/jira/browse/SPARK-35448 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35438) Minor documentation fix for window physical operator
[ https://issues.apache.org/jira/browse/SPARK-35438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-35438. -- Fix Version/s: 3.2.0 Assignee: Cheng Su Resolution: Fixed Resolved by https://github.com/apache/spark/pull/32585 > Minor documentation fix for window physical operator > > > Key: SPARK-35438 > URL: https://issues.apache.org/jira/browse/SPARK-35438 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > Fix For: 3.2.0 > > > As title. Fixed two places where the documentation has some error. Help > people read code more easily in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-35338) Separate arithmetic operations into data type based structures
[ https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reopened SPARK-35338: --- The PR was reverted at https://github.com/apache/spark/commit/d44e6c7f10528556dc5a64527e6f67e2ae7947fc due to Python linter issue. > Separate arithmetic operations into data type based structures > -- > > Key: SPARK-35338 > URL: https://issues.apache.org/jira/browse/SPARK-35338 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > For arithmetic operations: > {code:java} > __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__, > __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__ > {code} > we would like to separate them into data-type-based structures. > The existing behaviors of each arithmetic operation should be preserved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35338) Separate arithmetic operations into data type based structures
[ https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35338: Assignee: Apache Spark (was: Xinrong Meng) > Separate arithmetic operations into data type based structures > -- > > Key: SPARK-35338 > URL: https://issues.apache.org/jira/browse/SPARK-35338 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > Fix For: 3.2.0 > > > For arithmetic operations: > {code:java} > __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__, > __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__ > {code} > we would like to separate them into data-type-based structures. > The existing behaviors of each arithmetic operation should be preserved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35338) Separate arithmetic operations into data type based structures
[ https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-35338: -- Fix Version/s: (was: 3.2.0) > Separate arithmetic operations into data type based structures > -- > > Key: SPARK-35338 > URL: https://issues.apache.org/jira/browse/SPARK-35338 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > For arithmetic operations: > {code:java} > __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__, > __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__ > {code} > we would like to separate them into data-type-based structures. > The existing behaviors of each arithmetic operation should be preserved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35338) Separate arithmetic operations into data type based structures
[ https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35338: Assignee: Xinrong Meng (was: Apache Spark) > Separate arithmetic operations into data type based structures > -- > > Key: SPARK-35338 > URL: https://issues.apache.org/jira/browse/SPARK-35338 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > For arithmetic operations: > {code:java} > __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__, > __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__ > {code} > we would like to separate them into data-type-based structures. > The existing behaviors of each arithmetic operation should be preserved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35449: Assignee: Apache Spark > Should not extract common expressions from value expressions when elseValue > is empty in CaseWhen > > > Key: SPARK-35449 > URL: https://issues.apache.org/jira/browse/SPARK-35449 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347986#comment-17347986 ] Apache Spark commented on SPARK-35449: -- User 'Kimahriman' has created a pull request for this issue: https://github.com/apache/spark/pull/32595 > Should not extract common expressions from value expressions when elseValue > is empty in CaseWhen > > > Key: SPARK-35449 > URL: https://issues.apache.org/jira/browse/SPARK-35449 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35449: Assignee: (was: Apache Spark) > Should not extract common expressions from value expressions when elseValue > is empty in CaseWhen > > > Key: SPARK-35449 > URL: https://issues.apache.org/jira/browse/SPARK-35449 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347988#comment-17347988 ] Apache Spark commented on SPARK-35449: -- User 'Kimahriman' has created a pull request for this issue: https://github.com/apache/spark/pull/32595 > Should not extract common expressions from value expressions when elseValue > is empty in CaseWhen > > > Key: SPARK-35449 > URL: https://issues.apache.org/jira/browse/SPARK-35449 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35338) Separate arithmetic operations into data type based structures
[ https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347991#comment-17347991 ] Apache Spark commented on SPARK-35338: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/32596 > Separate arithmetic operations into data type based structures > -- > > Key: SPARK-35338 > URL: https://issues.apache.org/jira/browse/SPARK-35338 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > For arithmetic operations: > {code:java} > __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__, > __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__ > {code} > we would like to separate them into data-type-based structures. > The existing behaviors of each arithmetic operation should be preserved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35338) Separate arithmetic operations into data type based structures
[ https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347992#comment-17347992 ] Apache Spark commented on SPARK-35338: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/32596 > Separate arithmetic operations into data type based structures > -- > > Key: SPARK-35338 > URL: https://issues.apache.org/jira/browse/SPARK-35338 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > For arithmetic operations: > {code:java} > __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__, > __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__ > {code} > we would like to separate them into data-type-based structures. > The existing behaviors of each arithmetic operation should be preserved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35446) Override getJDBCType in MySQLDialect to map FloatType to FLOAT
[ https://issues.apache.org/jira/browse/SPARK-35446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347993#comment-17347993 ] Takeshi Yamamuro commented on SPARK-35446: -- Oh, I was a bit surprised that mysql treats real as double... Anyway, the proposal sounds good. do you wanna open a PR to fix this? > Override getJDBCType in MySQLDialect to map FloatType to FLOAT > -- > > Key: SPARK-35446 > URL: https://issues.apache.org/jira/browse/SPARK-35446 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Marios Meimaris >Priority: Major > > In > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L165,] > FloatType is mapped to REAL. However, MySQL treats REAL as a synonym to > DOUBLE by default (see > [https://dev.mysql.com/doc/refman/8.0/en/numeric-types.html] for more > information). > `MySQLDialect` could override `getJDBCType` so that it maps FloatType to > FLOAT instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35450) Follow checkout-merge way to use the latest commit for linter, or other workflows.
Takuya Ueshin created SPARK-35450: - Summary: Follow checkout-merge way to use the latest commit for linter, or other workflows. Key: SPARK-35450 URL: https://issues.apache.org/jira/browse/SPARK-35450 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 3.2.0 Reporter: Takuya Ueshin For linter or other workflows besides build-and-tests, we should follow checkout-merge way to use the latest commit; otherwise, those could work on the old settings. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35451) Sync with Apache Spark branch for all builds such as linter in GitHub Actions
Hyukjin Kwon created SPARK-35451: Summary: Sync with Apache Spark branch for all builds such as linter in GitHub Actions Key: SPARK-35451 URL: https://issues.apache.org/jira/browse/SPARK-35451 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.2.0 Reporter: Hyukjin Kwon We currently just simply checking out https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L471-L472 for some build-only or linters. This seems like easily casuing logical conflict when the PR is not synced against the latest changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35451) Sync with Apache Spark branch for all builds such as linter in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-35451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35451. -- Resolution: Duplicate > Sync with Apache Spark branch for all builds such as linter in GitHub Actions > - > > Key: SPARK-35451 > URL: https://issues.apache.org/jira/browse/SPARK-35451 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > We currently just simply checking out > https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L471-L472 > for some build-only or linters. > This seems like easily casuing logical conflict when the PR is not synced > against the latest changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35450) Follow checkout-merge way to use the latest commit for linter, or other workflows.
[ https://issues.apache.org/jira/browse/SPARK-35450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35450: Assignee: Apache Spark > Follow checkout-merge way to use the latest commit for linter, or other > workflows. > -- > > Key: SPARK-35450 > URL: https://issues.apache.org/jira/browse/SPARK-35450 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > For linter or other workflows besides build-and-tests, we should follow > checkout-merge way to use the latest commit; otherwise, those could work on > the old settings. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35450) Follow checkout-merge way to use the latest commit for linter, or other workflows.
[ https://issues.apache.org/jira/browse/SPARK-35450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35450: Assignee: (was: Apache Spark) > Follow checkout-merge way to use the latest commit for linter, or other > workflows. > -- > > Key: SPARK-35450 > URL: https://issues.apache.org/jira/browse/SPARK-35450 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > > For linter or other workflows besides build-and-tests, we should follow > checkout-merge way to use the latest commit; otherwise, those could work on > the old settings. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35450) Follow checkout-merge way to use the latest commit for linter, or other workflows.
[ https://issues.apache.org/jira/browse/SPARK-35450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347997#comment-17347997 ] Apache Spark commented on SPARK-35450: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/32597 > Follow checkout-merge way to use the latest commit for linter, or other > workflows. > -- > > Key: SPARK-35450 > URL: https://issues.apache.org/jira/browse/SPARK-35450 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > > For linter or other workflows besides build-and-tests, we should follow > checkout-merge way to use the latest commit; otherwise, those could work on > the old settings. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35408) Improve parameter validation in DataFrame.show
[ https://issues.apache.org/jira/browse/SPARK-35408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348000#comment-17348000 ] Apache Spark commented on SPARK-35408: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/32598 > Improve parameter validation in DataFrame.show > -- > > Key: SPARK-35408 > URL: https://issues.apache.org/jira/browse/SPARK-35408 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.1 >Reporter: Gera Shegalov >Assignee: Gera Shegalov >Priority: Major > Fix For: 3.2.0 > > > Being more used to Scala API the user may be tempted to call > {code:python} > df.show(False) > {code} > in PySparkl and she will receive an error message that does not easily map to > the user code > {noformat} > py4j.Py4JException: Method showString([class java.lang.Boolean, class > java.lang.Integer, class java.lang.Boolean]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) > at py4j.Gateway.invoke(Gateway.java:274) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:238) > at java.lang.Thread.run(Thread.java:748 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35320) from_json cannot parse maps with timestamp as key
[ https://issues.apache.org/jira/browse/SPARK-35320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35320: Assignee: Apache Spark > from_json cannot parse maps with timestamp as key > - > > Key: SPARK-35320 > URL: https://issues.apache.org/jira/browse/SPARK-35320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.1 > Environment: * Java 11 > * Spark 3.0.1/3.1.1 > * Scala 2.12 >Reporter: Vincenzo Cerminara >Assignee: Apache Spark >Priority: Minor > > I have a json that contains a {{map}} like the following > {code:json} > { > "map": { > "2021-05-05T20:05:08": "sampleValue" > } > } > {code} > The key of the map is a string containing a formatted timestamp and I want to > parse it as a Java Map using the from_json > Spark SQL function (see the {{Sample}} class in the code below). > {code:java} > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Encoders; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SparkSession; > import java.io.Serializable; > import java.time.Instant; > import java.util.List; > import java.util.Map; > import static org.apache.spark.sql.functions.*; > public class TimestampAsJsonMapKey { > public static class Sample implements Serializable { > private Map map; > > public Map getMap() { > return map; > } > > public void setMap(Map map) { > this.map = map; > } > } > public static class InvertedSample implements Serializable { > private Map map; > > public Map getMap() { > return map; > } > > public void setMap(Map map) { > this.map = map; > } > } > public static void main(String[] args) { > final SparkSession spark = SparkSession > .builder() > .appName("Timestamp As Json Map Key Test") > .master("local[1]") > .getOrCreate(); > workingTest(spark); > notWorkingTest(spark); > } > private static void workingTest(SparkSession spark) { > //language=JSON > final String invertedSampleJson = "{ \"map\": { \"sampleValue\": > \"2021-05-05T20:05:08\" } }"; > final Dataset samplesDf = > spark.createDataset(List.of(invertedSampleJson), Encoders.STRING()); > final Dataset parsedDf = > samplesDf.select(from_json(col("value"), > Encoders.bean(InvertedSample.class).schema())); > parsedDf.show(false); > } > private static void notWorkingTest(SparkSession spark) { > //language=JSON > final String sampleJson = "{ \"map\": { \"2021-05-05T20:05:08\": > \"sampleValue\" } }"; > final Dataset samplesDf = > spark.createDataset(List.of(sampleJson), Encoders.STRING()); > final Dataset parsedDf = > samplesDf.select(from_json(col("value"), > Encoders.bean(Sample.class).schema())); > parsedDf.show(false); > } > } > {code} > When I run the {{notWorkingTest}} method it fails with the following > exception: > {noformat} > Exception in thread "main" java.lang.ClassCastException: class > org.apache.spark.unsafe.types.UTF8String cannot be cast to class > java.lang.Long (org.apache.spark.unsafe.types.UTF8String is in unnamed module > of loader 'app'; java.lang.Long is in module java.base of loader 'bootstrap') > at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$8$adapted(Cast.scala:297) > at > org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$7(Cast.scala:297) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$12(Cast.scala:329) > at > org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$11(Cast.scala:321) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$14(Cast.scala:359) > at > org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$13(Cast.scala:352) > at > org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:815) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) > at > org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156) > at > org.apache.spark.sql.catalys
[jira] [Assigned] (SPARK-35320) from_json cannot parse maps with timestamp as key
[ https://issues.apache.org/jira/browse/SPARK-35320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35320: Assignee: (was: Apache Spark) > from_json cannot parse maps with timestamp as key > - > > Key: SPARK-35320 > URL: https://issues.apache.org/jira/browse/SPARK-35320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.1 > Environment: * Java 11 > * Spark 3.0.1/3.1.1 > * Scala 2.12 >Reporter: Vincenzo Cerminara >Priority: Minor > > I have a json that contains a {{map}} like the following > {code:json} > { > "map": { > "2021-05-05T20:05:08": "sampleValue" > } > } > {code} > The key of the map is a string containing a formatted timestamp and I want to > parse it as a Java Map using the from_json > Spark SQL function (see the {{Sample}} class in the code below). > {code:java} > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Encoders; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SparkSession; > import java.io.Serializable; > import java.time.Instant; > import java.util.List; > import java.util.Map; > import static org.apache.spark.sql.functions.*; > public class TimestampAsJsonMapKey { > public static class Sample implements Serializable { > private Map map; > > public Map getMap() { > return map; > } > > public void setMap(Map map) { > this.map = map; > } > } > public static class InvertedSample implements Serializable { > private Map map; > > public Map getMap() { > return map; > } > > public void setMap(Map map) { > this.map = map; > } > } > public static void main(String[] args) { > final SparkSession spark = SparkSession > .builder() > .appName("Timestamp As Json Map Key Test") > .master("local[1]") > .getOrCreate(); > workingTest(spark); > notWorkingTest(spark); > } > private static void workingTest(SparkSession spark) { > //language=JSON > final String invertedSampleJson = "{ \"map\": { \"sampleValue\": > \"2021-05-05T20:05:08\" } }"; > final Dataset samplesDf = > spark.createDataset(List.of(invertedSampleJson), Encoders.STRING()); > final Dataset parsedDf = > samplesDf.select(from_json(col("value"), > Encoders.bean(InvertedSample.class).schema())); > parsedDf.show(false); > } > private static void notWorkingTest(SparkSession spark) { > //language=JSON > final String sampleJson = "{ \"map\": { \"2021-05-05T20:05:08\": > \"sampleValue\" } }"; > final Dataset samplesDf = > spark.createDataset(List.of(sampleJson), Encoders.STRING()); > final Dataset parsedDf = > samplesDf.select(from_json(col("value"), > Encoders.bean(Sample.class).schema())); > parsedDf.show(false); > } > } > {code} > When I run the {{notWorkingTest}} method it fails with the following > exception: > {noformat} > Exception in thread "main" java.lang.ClassCastException: class > org.apache.spark.unsafe.types.UTF8String cannot be cast to class > java.lang.Long (org.apache.spark.unsafe.types.UTF8String is in unnamed module > of loader 'app'; java.lang.Long is in module java.base of loader 'bootstrap') > at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$8$adapted(Cast.scala:297) > at > org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$7(Cast.scala:297) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$12(Cast.scala:329) > at > org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$11(Cast.scala:321) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$14(Cast.scala:359) > at > org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$13(Cast.scala:352) > at > org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:815) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) > at > org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156) > at > org.apache.spark.sql.catalyst.expressions.Interpreted
[jira] [Assigned] (SPARK-35450) Follow checkout-merge way to use the latest commit for linter, or other workflows.
[ https://issues.apache.org/jira/browse/SPARK-35450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-35450: Assignee: Takuya Ueshin > Follow checkout-merge way to use the latest commit for linter, or other > workflows. > -- > > Key: SPARK-35450 > URL: https://issues.apache.org/jira/browse/SPARK-35450 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > > For linter or other workflows besides build-and-tests, we should follow > checkout-merge way to use the latest commit; otherwise, those could work on > the old settings. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35320) from_json cannot parse maps with timestamp as key
[ https://issues.apache.org/jira/browse/SPARK-35320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348002#comment-17348002 ] Apache Spark commented on SPARK-35320: -- User 'planga82' has created a pull request for this issue: https://github.com/apache/spark/pull/32599 > from_json cannot parse maps with timestamp as key > - > > Key: SPARK-35320 > URL: https://issues.apache.org/jira/browse/SPARK-35320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.1 > Environment: * Java 11 > * Spark 3.0.1/3.1.1 > * Scala 2.12 >Reporter: Vincenzo Cerminara >Priority: Minor > > I have a json that contains a {{map}} like the following > {code:json} > { > "map": { > "2021-05-05T20:05:08": "sampleValue" > } > } > {code} > The key of the map is a string containing a formatted timestamp and I want to > parse it as a Java Map using the from_json > Spark SQL function (see the {{Sample}} class in the code below). > {code:java} > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Encoders; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SparkSession; > import java.io.Serializable; > import java.time.Instant; > import java.util.List; > import java.util.Map; > import static org.apache.spark.sql.functions.*; > public class TimestampAsJsonMapKey { > public static class Sample implements Serializable { > private Map map; > > public Map getMap() { > return map; > } > > public void setMap(Map map) { > this.map = map; > } > } > public static class InvertedSample implements Serializable { > private Map map; > > public Map getMap() { > return map; > } > > public void setMap(Map map) { > this.map = map; > } > } > public static void main(String[] args) { > final SparkSession spark = SparkSession > .builder() > .appName("Timestamp As Json Map Key Test") > .master("local[1]") > .getOrCreate(); > workingTest(spark); > notWorkingTest(spark); > } > private static void workingTest(SparkSession spark) { > //language=JSON > final String invertedSampleJson = "{ \"map\": { \"sampleValue\": > \"2021-05-05T20:05:08\" } }"; > final Dataset samplesDf = > spark.createDataset(List.of(invertedSampleJson), Encoders.STRING()); > final Dataset parsedDf = > samplesDf.select(from_json(col("value"), > Encoders.bean(InvertedSample.class).schema())); > parsedDf.show(false); > } > private static void notWorkingTest(SparkSession spark) { > //language=JSON > final String sampleJson = "{ \"map\": { \"2021-05-05T20:05:08\": > \"sampleValue\" } }"; > final Dataset samplesDf = > spark.createDataset(List.of(sampleJson), Encoders.STRING()); > final Dataset parsedDf = > samplesDf.select(from_json(col("value"), > Encoders.bean(Sample.class).schema())); > parsedDf.show(false); > } > } > {code} > When I run the {{notWorkingTest}} method it fails with the following > exception: > {noformat} > Exception in thread "main" java.lang.ClassCastException: class > org.apache.spark.unsafe.types.UTF8String cannot be cast to class > java.lang.Long (org.apache.spark.unsafe.types.UTF8String is in unnamed module > of loader 'app'; java.lang.Long is in module java.base of loader 'bootstrap') > at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$8$adapted(Cast.scala:297) > at > org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$7(Cast.scala:297) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$12(Cast.scala:329) > at > org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$11(Cast.scala:321) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$14(Cast.scala:359) > at > org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$13(Cast.scala:352) > at > org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:815) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) > at > org.apache.spark.sql.catalyst.exp
[jira] [Resolved] (SPARK-35450) Follow checkout-merge way to use the latest commit for linter, or other workflows.
[ https://issues.apache.org/jira/browse/SPARK-35450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35450. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32597 [https://github.com/apache/spark/pull/32597] > Follow checkout-merge way to use the latest commit for linter, or other > workflows. > -- > > Key: SPARK-35450 > URL: https://issues.apache.org/jira/browse/SPARK-35450 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.2.0 > > > For linter or other workflows besides build-and-tests, we should follow > checkout-merge way to use the latest commit; otherwise, those could work on > the old settings. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35452) Introduce ComplexTypeOps
Xinrong Meng created SPARK-35452: Summary: Introduce ComplexTypeOps Key: SPARK-35452 URL: https://issues.apache.org/jira/browse/SPARK-35452 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.2.0 Reporter: Xinrong Meng StructType, ArrayType, MapType, and BinaryType are not accepted by DataTypeOps now. We should handle these complex types. Please note that arithmetic operations might be applied to these complex types, for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35453) Move Koalas accessor to pandas_on_spark accessor
Haejoon Lee created SPARK-35453: --- Summary: Move Koalas accessor to pandas_on_spark accessor Key: SPARK-35453 URL: https://issues.apache.org/jira/browse/SPARK-35453 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.2.0 Reporter: Haejoon Lee The existing Koalas has the "Koalas accessor" which named after Koalas project. We should rename this accessor to "Pandas on Spark accessor". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35453) Move Koalas accessor to pandas_on_spark accessor
[ https://issues.apache.org/jira/browse/SPARK-35453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348021#comment-17348021 ] Haejoon Lee commented on SPARK-35453: - I'm working on this > Move Koalas accessor to pandas_on_spark accessor > > > Key: SPARK-35453 > URL: https://issues.apache.org/jira/browse/SPARK-35453 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > The existing Koalas has the "Koalas accessor" which named after Koalas > project. > > We should rename this accessor to "Pandas on Spark accessor". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35338) Separate arithmetic operations into data type based structures
[ https://issues.apache.org/jira/browse/SPARK-35338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-35338. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32596 https://github.com/apache/spark/pull/32596 > Separate arithmetic operations into data type based structures > -- > > Key: SPARK-35338 > URL: https://issues.apache.org/jira/browse/SPARK-35338 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > For arithmetic operations: > {code:java} > __add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__, > __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__ > {code} > we would like to separate them into data-type-based structures. > The existing behaviors of each arithmetic operation should be preserved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33428) conv UDF returns incorrect value
[ https://issues.apache.org/jira/browse/SPARK-33428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348024#comment-17348024 ] Nguyen Quang Huy commented on SPARK-33428: -- should we consider return null value if overflow occur since bigInt cann't handle signed and unsigned base? > conv UDF returns incorrect value > > > Key: SPARK-33428 > URL: https://issues.apache.org/jira/browse/SPARK-33428 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {noformat} > spark-sql> select java_method('scala.math.BigInt', 'apply', > 'c8dcdfb41711fc9a1f17928001d7fd61', 16); > 266992441711411603393340504520074460513 > spark-sql> select conv('c8dcdfb41711fc9a1f17928001d7fd61', 16, 10); > 18446744073709551615 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35454) Ambiguous self-join doesn't fail after transfroming the dataset to dataframe
wuyi created SPARK-35454: Summary: Ambiguous self-join doesn't fail after transfroming the dataset to dataframe Key: SPARK-35454 URL: https://issues.apache.org/jira/browse/SPARK-35454 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1, 3.1.0 Reporter: wuyi {code:java} test("SPARK-28344: fail ambiguous self join - Dataset.colRegex as column ref") { val df1 = spark.range(3) val df2 = df1.filter($"id" > 0) withSQLConf( SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true", SQLConf.CROSS_JOINS_ENABLED.key -> "true") { assertAmbiguousSelfJoin(df1.join(df2, df1.colRegex("id") > df2.colRegex("id"))) } } {code} For this unit test, if we append `.toDF()` to both df1 and df2, the query won't fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used
[ https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348033#comment-17348033 ] Yuzhou Sun commented on SPARK-35106: Thanks Erik and Wenchen for initializing and reviewing the fix! > HadoopMapReduceCommitProtocol performs bad rename when dynamic partition > overwrite is used > -- > > Key: SPARK-35106 > URL: https://issues.apache.org/jira/browse/SPARK-35106 > Project: Spark > Issue Type: Bug > Components: Input/Output, Spark Core >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > Recently when evaluating the code in > {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under > the {{dynamicPartitionOverwrite == true}} scenario: > {code:language=scala} > # BLOCK 1 > if (dynamicPartitionOverwrite) { > val absPartitionPaths = filesToMove.values.map(new > Path(_).getParent).toSet > logDebug(s"Clean up absolute partition directories for overwriting: > $absPartitionPaths") > absPartitionPaths.foreach(fs.delete(_, true)) > } > # BLOCK 2 > for ((src, dst) <- filesToMove) { > fs.rename(new Path(src), new Path(dst)) > } > # BLOCK 3 > if (dynamicPartitionOverwrite) { > val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _) > logDebug(s"Clean up default partition directories for overwriting: > $partitionPaths") > for (part <- partitionPaths) { > val finalPartPath = new Path(path, part) > if (!fs.delete(finalPartPath, true) && > !fs.exists(finalPartPath.getParent)) { > // According to the official hadoop FileSystem API spec, delete > op should assume > // the destination is no longer present regardless of return > value, thus we do not > // need to double check if finalPartPath exists before rename. > // Also in our case, based on the spec, delete returns false only > when finalPartPath > // does not exist. When this happens, we need to take action if > parent of finalPartPath > // also does not exist(e.g. the scenario described on > SPARK-23815), because > // FileSystem API spec on rename op says the rename > dest(finalPartPath) must have > // a parent that exists, otherwise we may get unexpected result > on the rename. > fs.mkdirs(finalPartPath.getParent) > } > fs.rename(new Path(stagingDir, part), finalPartPath) > } > } > {code} > Assuming {{dynamicPartitionOverwrite == true}}, we have the following > sequence of events: > # Block 1 deletes all parent directories of {{filesToMove.values}} > # Block 2 attempts to rename all {{filesToMove.keys}} to > {{filesToMove.values}} > # Block 3 does directory-level renames to place files into their final > locations > All renames in Block 2 will always fail, since all parent directories of > {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS > scenario, the contract of {{fs.rename}} is to return {{false}} under such a > failure scenario, as opposed to throwing an exception. There is a separate > issue here that Block 2 should probably be checking for those {{false}} > return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", > albeit with a bunch of failed renames in the middle. Really, we should only > run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and > consolidate Blocks 1 and 3 to run in the {{true}} case. > We discovered this issue when testing against a {{FileSystem}} implementation > which was throwing an exception for this failed rename scenario instead of > returning false, escalating the silent/ignored rename failures into actual > failures. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35455) Enhance EliminateUnnecessaryJoin
XiDuo You created SPARK-35455: - Summary: Enhance EliminateUnnecessaryJoin Key: SPARK-35455 URL: https://issues.apache.org/jira/browse/SPARK-35455 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: XiDuo You Make EliminateUnnecessaryJoin support to eliminate outer join and multi-join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35456) Show invalid value in config entry check error message
Kent Yao created SPARK-35456: Summary: Show invalid value in config entry check error message Key: SPARK-35456 URL: https://issues.apache.org/jira/browse/SPARK-35456 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35443) Mark K8s secrets and config maps as immutable
[ https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-35443: - Assignee: Ashray Jain > Mark K8s secrets and config maps as immutable > - > > Key: SPARK-35443 > URL: https://issues.apache.org/jira/browse/SPARK-35443 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Ashray Jain >Assignee: Ashray Jain >Priority: Minor > Labels: kubernetes > > Kubernetes supports marking secrets and config maps as immutable to gain > performance. > [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable] > [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable] > For K8s clusters that run many thousands of Spark applications, this can > yield significant reduction in load on the kube-apiserver. > From the K8s docs: > {quote}For clusters that extensively use Secrets (at least tens of thousands > of unique Secret to Pod mounts), preventing changes to their data has the > following advantages: > * protects you from accidental (or unwanted) updates that could cause > applications outages > * improves performance of your cluster by significantly reducing load on > kube-apiserver, by closing watches for secrets marked as immutable.{quote} > > For any secrets and config maps we create in Spark that are immutable, we > could mark them as immutable by including the following when building the > secret/config map > {code:java} > .withImmutable(true) > {code} > This feature has been supported in K8s as beta since K8s 1.19 and as GA since > K8s 1.21 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35443) Mark K8s secrets and config maps as immutable
[ https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-35443. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32588 [https://github.com/apache/spark/pull/32588] > Mark K8s secrets and config maps as immutable > - > > Key: SPARK-35443 > URL: https://issues.apache.org/jira/browse/SPARK-35443 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Ashray Jain >Assignee: Ashray Jain >Priority: Minor > Labels: kubernetes > Fix For: 3.2.0 > > > Kubernetes supports marking secrets and config maps as immutable to gain > performance. > [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable] > [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable] > For K8s clusters that run many thousands of Spark applications, this can > yield significant reduction in load on the kube-apiserver. > From the K8s docs: > {quote}For clusters that extensively use Secrets (at least tens of thousands > of unique Secret to Pod mounts), preventing changes to their data has the > following advantages: > * protects you from accidental (or unwanted) updates that could cause > applications outages > * improves performance of your cluster by significantly reducing load on > kube-apiserver, by closing watches for secrets marked as immutable.{quote} > > For any secrets and config maps we create in Spark that are immutable, we > could mark them as immutable by including the following when building the > secret/config map > {code:java} > .withImmutable(true) > {code} > This feature has been supported in K8s as beta since K8s 1.19 and as GA since > K8s 1.21 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27991) ShuffleBlockFetcherIterator should take Netty constant-factor overheads into account when limiting number of simultaneous block fetches
[ https://issues.apache.org/jira/browse/SPARK-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27991. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32287 [https://github.com/apache/spark/pull/32287] > ShuffleBlockFetcherIterator should take Netty constant-factor overheads into > account when limiting number of simultaneous block fetches > --- > > Key: SPARK-27991 > URL: https://issues.apache.org/jira/browse/SPARK-27991 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > Fix For: 3.2.0 > > > ShuffleBlockFetcherIterator has logic to limit the number of simultaneous > block fetches. By default, this logic tries to keep the number of outstanding > block fetches [beneath a data size > limit|https://github.com/apache/spark/blob/v2.4.3/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L274] > ({{maxBytesInFlight}}). However, this limiting does not take fixed overheads > into account: even though a remote block might be, say, 4KB, there are > certain fixed-size internal overheads due to Netty buffer sizes which may > cause the actual space requirements to be larger. > As a result, if a map stage produces a huge number of extremely tiny blocks > then we may see errors like > {code:java} > org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 > byte(s) of direct memory (used: 39325794304, max: 39325794304) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485) > [...] > Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate > 16777216 byte(s) of direct memory (used: 39325794304, max: 39325794304) > at > io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640) > at > io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594) > at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:226) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:146) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324) > [...]{code} > SPARK-24989 is another report of this problem (but with a different proposed > fix). > This problem can currently be mitigated by setting > {{spark.reducer.maxReqsInFlight}} to some some non-IntMax value (SPARK-6166), > but this additional manual configuration step is cumbersome. > Instead, I think that Spark should take these fixed overheads into account in > the {{maxBytesInFlight}} calculation: instead of using blocks' actual sizes, > use {{Math.min(blockSize, minimumNettyBufferSize)}}. There might be some > tricky details involved to make this work on all configurations (e.g. to use > a different minimum when direct buffers are disabled, etc.), but I think the > core idea behind the fix is pretty simple. > This will improve Spark's stability and removes configuration / tuning burden > from end users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35443) Mark K8s secrets and config maps as immutable
[ https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35443: -- Labels: (was: kubernetes) > Mark K8s secrets and config maps as immutable > - > > Key: SPARK-35443 > URL: https://issues.apache.org/jira/browse/SPARK-35443 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Ashray Jain >Assignee: Ashray Jain >Priority: Minor > Fix For: 3.2.0 > > > Kubernetes supports marking secrets and config maps as immutable to gain > performance. > [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable] > [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable] > For K8s clusters that run many thousands of Spark applications, this can > yield significant reduction in load on the kube-apiserver. > From the K8s docs: > {quote}For clusters that extensively use Secrets (at least tens of thousands > of unique Secret to Pod mounts), preventing changes to their data has the > following advantages: > * protects you from accidental (or unwanted) updates that could cause > applications outages > * improves performance of your cluster by significantly reducing load on > kube-apiserver, by closing watches for secrets marked as immutable.{quote} > > For any secrets and config maps we create in Spark that are immutable, we > could mark them as immutable by including the following when building the > secret/config map > {code:java} > .withImmutable(true) > {code} > This feature has been supported in K8s as beta since K8s 1.19 and as GA since > K8s 1.21 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35443) Mark K8s secrets and config maps as immutable
[ https://issues.apache.org/jira/browse/SPARK-35443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35443: -- Affects Version/s: (was: 3.1.1) 3.2.0 > Mark K8s secrets and config maps as immutable > - > > Key: SPARK-35443 > URL: https://issues.apache.org/jira/browse/SPARK-35443 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Ashray Jain >Assignee: Ashray Jain >Priority: Minor > Labels: kubernetes > Fix For: 3.2.0 > > > Kubernetes supports marking secrets and config maps as immutable to gain > performance. > [https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable] > [https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable] > For K8s clusters that run many thousands of Spark applications, this can > yield significant reduction in load on the kube-apiserver. > From the K8s docs: > {quote}For clusters that extensively use Secrets (at least tens of thousands > of unique Secret to Pod mounts), preventing changes to their data has the > following advantages: > * protects you from accidental (or unwanted) updates that could cause > applications outages > * improves performance of your cluster by significantly reducing load on > kube-apiserver, by closing watches for secrets marked as immutable.{quote} > > For any secrets and config maps we create in Spark that are immutable, we > could mark them as immutable by including the following when building the > secret/config map > {code:java} > .withImmutable(true) > {code} > This feature has been supported in K8s as beta since K8s 1.19 and as GA since > K8s 1.21 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27991) ShuffleBlockFetcherIterator should take Netty constant-factor overheads into account when limiting number of simultaneous block fetches
[ https://issues.apache.org/jira/browse/SPARK-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27991: --- Assignee: wuyi > ShuffleBlockFetcherIterator should take Netty constant-factor overheads into > account when limiting number of simultaneous block fetches > --- > > Key: SPARK-27991 > URL: https://issues.apache.org/jira/browse/SPARK-27991 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Assignee: wuyi >Priority: Major > Fix For: 3.2.0 > > > ShuffleBlockFetcherIterator has logic to limit the number of simultaneous > block fetches. By default, this logic tries to keep the number of outstanding > block fetches [beneath a data size > limit|https://github.com/apache/spark/blob/v2.4.3/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L274] > ({{maxBytesInFlight}}). However, this limiting does not take fixed overheads > into account: even though a remote block might be, say, 4KB, there are > certain fixed-size internal overheads due to Netty buffer sizes which may > cause the actual space requirements to be larger. > As a result, if a map stage produces a huge number of extremely tiny blocks > then we may see errors like > {code:java} > org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 > byte(s) of direct memory (used: 39325794304, max: 39325794304) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485) > [...] > Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate > 16777216 byte(s) of direct memory (used: 39325794304, max: 39325794304) > at > io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640) > at > io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594) > at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:226) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:146) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324) > [...]{code} > SPARK-24989 is another report of this problem (but with a different proposed > fix). > This problem can currently be mitigated by setting > {{spark.reducer.maxReqsInFlight}} to some some non-IntMax value (SPARK-6166), > but this additional manual configuration step is cumbersome. > Instead, I think that Spark should take these fixed overheads into account in > the {{maxBytesInFlight}} calculation: instead of using blocks' actual sizes, > use {{Math.min(blockSize, minimumNettyBufferSize)}}. There might be some > tricky details involved to make this work on all configurations (e.g. to use > a different minimum when direct buffers are disabled, etc.), but I think the > core idea behind the fix is pretty simple. > This will improve Spark's stability and removes configuration / tuning burden > from end users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used
[ https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-35106: --- Assignee: (was: Erik Krogen) > HadoopMapReduceCommitProtocol performs bad rename when dynamic partition > overwrite is used > -- > > Key: SPARK-35106 > URL: https://issues.apache.org/jira/browse/SPARK-35106 > Project: Spark > Issue Type: Bug > Components: Input/Output, Spark Core >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Priority: Major > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > Recently when evaluating the code in > {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under > the {{dynamicPartitionOverwrite == true}} scenario: > {code:language=scala} > # BLOCK 1 > if (dynamicPartitionOverwrite) { > val absPartitionPaths = filesToMove.values.map(new > Path(_).getParent).toSet > logDebug(s"Clean up absolute partition directories for overwriting: > $absPartitionPaths") > absPartitionPaths.foreach(fs.delete(_, true)) > } > # BLOCK 2 > for ((src, dst) <- filesToMove) { > fs.rename(new Path(src), new Path(dst)) > } > # BLOCK 3 > if (dynamicPartitionOverwrite) { > val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _) > logDebug(s"Clean up default partition directories for overwriting: > $partitionPaths") > for (part <- partitionPaths) { > val finalPartPath = new Path(path, part) > if (!fs.delete(finalPartPath, true) && > !fs.exists(finalPartPath.getParent)) { > // According to the official hadoop FileSystem API spec, delete > op should assume > // the destination is no longer present regardless of return > value, thus we do not > // need to double check if finalPartPath exists before rename. > // Also in our case, based on the spec, delete returns false only > when finalPartPath > // does not exist. When this happens, we need to take action if > parent of finalPartPath > // also does not exist(e.g. the scenario described on > SPARK-23815), because > // FileSystem API spec on rename op says the rename > dest(finalPartPath) must have > // a parent that exists, otherwise we may get unexpected result > on the rename. > fs.mkdirs(finalPartPath.getParent) > } > fs.rename(new Path(stagingDir, part), finalPartPath) > } > } > {code} > Assuming {{dynamicPartitionOverwrite == true}}, we have the following > sequence of events: > # Block 1 deletes all parent directories of {{filesToMove.values}} > # Block 2 attempts to rename all {{filesToMove.keys}} to > {{filesToMove.values}} > # Block 3 does directory-level renames to place files into their final > locations > All renames in Block 2 will always fail, since all parent directories of > {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS > scenario, the contract of {{fs.rename}} is to return {{false}} under such a > failure scenario, as opposed to throwing an exception. There is a separate > issue here that Block 2 should probably be checking for those {{false}} > return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", > albeit with a bunch of failed renames in the middle. Really, we should only > run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and > consolidate Blocks 1 and 3 to run in the {{true}} case. > We discovered this issue when testing against a {{FileSystem}} implementation > which was throwing an exception for this failed rename scenario instead of > returning false, escalating the silent/ignored rename failures into actual > failures. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used
[ https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-35106: --- Assignee: Yuzhou Sun > HadoopMapReduceCommitProtocol performs bad rename when dynamic partition > overwrite is used > -- > > Key: SPARK-35106 > URL: https://issues.apache.org/jira/browse/SPARK-35106 > Project: Spark > Issue Type: Bug > Components: Input/Output, Spark Core >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Assignee: Yuzhou Sun >Priority: Major > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > Recently when evaluating the code in > {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under > the {{dynamicPartitionOverwrite == true}} scenario: > {code:language=scala} > # BLOCK 1 > if (dynamicPartitionOverwrite) { > val absPartitionPaths = filesToMove.values.map(new > Path(_).getParent).toSet > logDebug(s"Clean up absolute partition directories for overwriting: > $absPartitionPaths") > absPartitionPaths.foreach(fs.delete(_, true)) > } > # BLOCK 2 > for ((src, dst) <- filesToMove) { > fs.rename(new Path(src), new Path(dst)) > } > # BLOCK 3 > if (dynamicPartitionOverwrite) { > val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _) > logDebug(s"Clean up default partition directories for overwriting: > $partitionPaths") > for (part <- partitionPaths) { > val finalPartPath = new Path(path, part) > if (!fs.delete(finalPartPath, true) && > !fs.exists(finalPartPath.getParent)) { > // According to the official hadoop FileSystem API spec, delete > op should assume > // the destination is no longer present regardless of return > value, thus we do not > // need to double check if finalPartPath exists before rename. > // Also in our case, based on the spec, delete returns false only > when finalPartPath > // does not exist. When this happens, we need to take action if > parent of finalPartPath > // also does not exist(e.g. the scenario described on > SPARK-23815), because > // FileSystem API spec on rename op says the rename > dest(finalPartPath) must have > // a parent that exists, otherwise we may get unexpected result > on the rename. > fs.mkdirs(finalPartPath.getParent) > } > fs.rename(new Path(stagingDir, part), finalPartPath) > } > } > {code} > Assuming {{dynamicPartitionOverwrite == true}}, we have the following > sequence of events: > # Block 1 deletes all parent directories of {{filesToMove.values}} > # Block 2 attempts to rename all {{filesToMove.keys}} to > {{filesToMove.values}} > # Block 3 does directory-level renames to place files into their final > locations > All renames in Block 2 will always fail, since all parent directories of > {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS > scenario, the contract of {{fs.rename}} is to return {{false}} under such a > failure scenario, as opposed to throwing an exception. There is a separate > issue here that Block 2 should probably be checking for those {{false}} > return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", > albeit with a bunch of failed renames in the middle. Really, we should only > run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and > consolidate Blocks 1 and 3 to run in the {{true}} case. > We discovered this issue when testing against a {{FileSystem}} implementation > which was throwing an exception for this failed rename scenario instead of > returning false, escalating the silent/ignored rename failures into actual > failures. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35456) Show invalid value in config entry check error message
[ https://issues.apache.org/jira/browse/SPARK-35456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35456: Assignee: (was: Apache Spark) > Show invalid value in config entry check error message > -- > > Key: SPARK-35456 > URL: https://issues.apache.org/jira/browse/SPARK-35456 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kent Yao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35456) Show invalid value in config entry check error message
[ https://issues.apache.org/jira/browse/SPARK-35456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348060#comment-17348060 ] Apache Spark commented on SPARK-35456: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/32600 > Show invalid value in config entry check error message > -- > > Key: SPARK-35456 > URL: https://issues.apache.org/jira/browse/SPARK-35456 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kent Yao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35456) Show invalid value in config entry check error message
[ https://issues.apache.org/jira/browse/SPARK-35456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35456: Assignee: Apache Spark > Show invalid value in config entry check error message > -- > > Key: SPARK-35456 > URL: https://issues.apache.org/jira/browse/SPARK-35456 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35130) Construct day-time interval column from integral fields
[ https://issues.apache.org/jira/browse/SPARK-35130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348064#comment-17348064 ] Apache Spark commented on SPARK-35130: -- User 'copperybean' has created a pull request for this issue: https://github.com/apache/spark/pull/32601 > Construct day-time interval column from integral fields > --- > > Key: SPARK-35130 > URL: https://issues.apache.org/jira/browse/SPARK-35130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > Create new function similar to make_interval() (or extend the make_interval() > function) which can construct DayTimeIntervalType values from day, hour, > minute, second fields. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org