[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics
[ https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872075#comment-16872075 ] Luca Canali commented on SPARK-28091: - Thank you [~ste...@apache.org] for your comment and clarifications. Indeed what I am trying to do is to collect executor-level metrics for S3A (and also for other Hadoop compatible filesystems of interest). The goal is to bring the metrics into the Spark metrics system, so that they can be used, for example, in a performance dashboard and displayed together with the rest of the instrumentation metrics. The original work for this started from the need of measuring I/O metrics for a custom HDFS-compatible filesystem that we use (called ROOT:) and more recently also for S3A. The first implementation we did was a simple: a small change in [[ExecutorSource]], which already has code to collect metrics for "hdfs" and "file"/local filesystems at the executor level, obviously that code is very easy to extend, however it feels like a short-term hack going that way. My though with this PR is to provide a flexible method to add instrumentation, profiting from our current use case related to I/O workload monitoring, but also open to several other use cases. I am also quite interested to see developmnets in this area for CPU counters and possibly also GPU-related instrumentation. I think the proposal to use executor plugins for this goes in the original direction outlined by [~irashid] and collaborators with SPARK-24918 I add some links to reference and related material: code of a few test executor metrics plugins that I am developing: [https://github.com/cerndb/SparkExecutorPlugins] The general idea of how to build a dashboard with Spark metrics is described in [https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark] > Extend Spark metrics system with executor plugin metrics > > > Key: SPARK-28091 > URL: https://issues.apache.org/jira/browse/SPARK-28091 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > > This proposes to improve Spark instrumentation by adding a hook for Spark > executor plugin metrics to the Spark metrics systems implemented with the > Dropwizard/Codahale library. > Context: The Spark metrics system provides a large variety of metrics, see > also SPARK-26890, useful to monitor and troubleshoot Spark workloads. A > typical workflow is to sink the metrics to a storage system and build > dashboards on top of that. > Improvement: The original goal of this work was to add instrumentation for S3 > filesystem access metrics by Spark job. Currently, [[ExecutorSource]] > instruments HDFS and local filesystem metrics. Rather than extending the code > there, we proposes to add a metrics plugin system which is of more flexible > and general use. > Advantages: > * The metric plugin system makes it easy to implement instrumentation for S3 > access by Spark jobs. > * The metrics plugin system allows for easy extensions of how Spark collects > HDFS-related workload metrics. This is currently done using the Hadoop > Filesystem GetAllStatistics method, which is deprecated in recent versions of > Hadoop. Recent versions of Hadoop Filesystem recommend using method > GetGlobalStorageStatistics, which also provides several additional metrics. > GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been > introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an > easy way to “opt in” using such new API calls for those deploying suitable > Hadoop versions. > * We also have the use case of adding Hadoop filesystem monitoring for a > custom Hadoop compliant filesystem in use in our organization (EOS using the > XRootD protocol). The metrics plugin infrastructure makes this easy to do. > Others may have similar use cases. > * More generally, this method makes it straightforward to plug in Filesystem > and other metrics to the Spark monitoring system. Future work on plugin > implementation can address extending monitoring to measure usage of external > resources (OS, filesystem, network, accelerator cards, etc), that maybe would > not normally be considered general enough for inclusion in Apache Spark code, > but that can be nevertheless useful for specialized use cases, tests or > troubleshooting. > Implementation: > The proposed implementation is currently a WIP open for comments and > improvements. It is based on the work on Executor Plugin of SPARK-24918 and > builds on recent work on extending Spark executor metrics, such as SPARK-25228 > Tests and examples: > This has been so far manually tested running Spark on YARN and K8S cluster
[jira] [Assigned] (SPARK-28149) Disable negeative DNS caching
[ https://issues.apache.org/jira/browse/SPARK-28149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28149: Assignee: (was: Apache Spark) > Disable negeative DNS caching > - > > Key: SPARK-28149 > URL: https://issues.apache.org/jira/browse/SPARK-28149 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Jose Luis Pedrosa >Priority: Minor > > By default JVM caches the failures for the DNS resolutions, by default is > cached by 10 seconds. > Alpine JDK used in the images for kubernetes has a default timout of 5 > seconds. > This means that in clusters with slow init time (network sidecar pods, slow > network start up) executor will never run, because the first attempt to > connect to the driver will fail, and that failure will be cached, causing > the retries to happen in a tight loop without actually trying again. > > The proposed implementation would be to add to the entrypoint.sh (that is > exclusive for k8s) to alter the file with the dns caching, and disable it if > there's an environment variable as "DISABLE_DNS_NEGATIVE_CACHING" defined. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28149) Disable negeative DNS caching
[ https://issues.apache.org/jira/browse/SPARK-28149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28149: Assignee: Apache Spark > Disable negeative DNS caching > - > > Key: SPARK-28149 > URL: https://issues.apache.org/jira/browse/SPARK-28149 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Jose Luis Pedrosa >Assignee: Apache Spark >Priority: Minor > > By default JVM caches the failures for the DNS resolutions, by default is > cached by 10 seconds. > Alpine JDK used in the images for kubernetes has a default timout of 5 > seconds. > This means that in clusters with slow init time (network sidecar pods, slow > network start up) executor will never run, because the first attempt to > connect to the driver will fail, and that failure will be cached, causing > the retries to happen in a tight loop without actually trying again. > > The proposed implementation would be to add to the entrypoint.sh (that is > exclusive for k8s) to alter the file with the dns caching, and disable it if > there's an environment variable as "DISABLE_DNS_NEGATIVE_CACHING" defined. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28158) Hive UDFs supports UDT type
[ https://issues.apache.org/jira/browse/SPARK-28158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28158: Assignee: (was: Apache Spark) > Hive UDFs supports UDT type > --- > > Key: SPARK-28158 > URL: https://issues.apache.org/jira/browse/SPARK-28158 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Genmao Yu >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28158) Hive UDFs supports UDT type
[ https://issues.apache.org/jira/browse/SPARK-28158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28158: Assignee: Apache Spark > Hive UDFs supports UDT type > --- > > Key: SPARK-28158 > URL: https://issues.apache.org/jira/browse/SPARK-28158 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Genmao Yu >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28158) Hive UDFs supports UDT type
Genmao Yu created SPARK-28158: - Summary: Hive UDFs supports UDT type Key: SPARK-28158 URL: https://issues.apache.org/jira/browse/SPARK-28158 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3, 3.0.0 Reporter: Genmao Yu -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28157) Make SHS check Spark event log file permission changes
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28157. --- Resolution: Invalid My bad. This issue is invalid. > Make SHS check Spark event log file permission changes > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Priority: Major > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a permanent blacklist for all event log files > failed once at reading. Although this reduces a lot of invalid accesses, > there is no way to see this log files back after the permissions are > recovered correctly. The only way has been restarting SHS. > Apache Spark is unable to know the permission recovery. However, we had > better give a second chances for those blacklisted files in a regular manner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28036) Built-in udf left/right has inconsistent behavior
[ https://issues.apache.org/jira/browse/SPARK-28036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872050#comment-16872050 ] Yuming Wang commented on SPARK-28036: - {code:sql} [root@spark-3267648 spark-3.0.0-SNAPSHOT-bin-3.2.0]# bin/spark-shell 19/06/24 23:10:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://spark-3267648.lvs02.dev.ebayc3.com:4040 Spark context available as 'sc' (master = local[*], app id = local-1561443018277). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("select left('ahoj', -2), right('ahoj', -2)").show ++-+ |left('ahoj', -2)|right('ahoj', -2)| ++-+ || | ++-+ scala> spark.sql("select left('ahoj', 2), right('ahoj', 2)").show +---++ |left('ahoj', 2)|right('ahoj', 2)| +---++ | ah| oj| +---++ {code} > Built-in udf left/right has inconsistent behavior > - > > Key: SPARK-28036 > URL: https://issues.apache.org/jira/browse/SPARK-28036 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > PostgreSQL: > {code:sql} > postgres=# select left('ahoj', -2), right('ahoj', -2); > left | right > --+--- > ah | oj > (1 row) > {code} > Spark SQL: > {code:sql} > spark-sql> select left('ahoj', -2), right('ahoj', -2); > spark-sql> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28036) Built-in udf left/right has inconsistent behavior
[ https://issues.apache.org/jira/browse/SPARK-28036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872045#comment-16872045 ] Shivu Sondur commented on SPARK-28036: -- [~yumwang] select left('ahoj', 2), right('ahoj', 2); use with '-' sign, it will work fine, i tested in latest spark > Built-in udf left/right has inconsistent behavior > - > > Key: SPARK-28036 > URL: https://issues.apache.org/jira/browse/SPARK-28036 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > PostgreSQL: > {code:sql} > postgres=# select left('ahoj', -2), right('ahoj', -2); > left | right > --+--- > ah | oj > (1 row) > {code} > Spark SQL: > {code:sql} > spark-sql> select left('ahoj', -2), right('ahoj', -2); > spark-sql> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28156) Join plan sometimes does not use cached query
[ https://issues.apache.org/jira/browse/SPARK-28156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28156: Assignee: (was: Apache Spark) > Join plan sometimes does not use cached query > - > > Key: SPARK-28156 > URL: https://issues.apache.org/jira/browse/SPARK-28156 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3, 3.0.0, 2.4.3 >Reporter: Bruce Robbins >Priority: Major > > I came across a case where a cached query is referenced on both sides of a > join, but the InMemoryRelation is inserted on only one side. This case occurs > only when the cached query uses a (Hive-style) view. > Consider this example: > {noformat} > // create the data > val df1 = Seq.tabulate(10) { x => (x, x + 1, x + 2, x + 3) }.toDF("a", "b", > "c", "d") > df1.write.mode("overwrite").format("orc").saveAsTable("table1") > sql("drop view if exists table1_vw") > sql("create view table1_vw as select * from table1") > // create the cached query > val cacheddataDf = sql(""" > select a, b, c, d > from table1_vw > """) > import org.apache.spark.storage.StorageLevel.DISK_ONLY > cacheddataDf.createOrReplaceTempView("cacheddata") > cacheddataDf.persist(DISK_ONLY) > // main query > val queryDf = sql(s""" > select leftside.a, leftside.b > from cacheddata leftside > join cacheddata rightside > on leftside.a = rightside.a > """) > queryDf.explain(true) > {noformat} > Note that the optimized plan does not use an InMemoryRelation for the right > side, but instead just uses a Relation: > {noformat} > Project [a#45, b#46] > +- Join Inner, (a#45 = a#37) >:- Project [a#45, b#46] >: +- Filter isnotnull(a#45) >: +- InMemoryRelation [a#45, b#46, c#47, d#48], StorageLevel(disk, 1 > replicas) >: +- *(1) FileScan orc default.table1[a#37,b#38,c#39,d#40] > Batched: true, DataFilters: [], Format: ORC, Location: > InMemoryFileIndex[file:/Users/brobbins/github/spark_upstream/spark-warehouse/table1], > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct >+- Project [a#37] > +- Filter isnotnull(a#37) > +- Relation[a#37,b#38,c#39,d#40] orc > {noformat} > The fragment does not match the cached query because AliasViewChild adds an > extra projection under the View on the right side (see #2 below). > AliasViewChild adds the extra projection because the exprIds in the View's > output appears to have been renamed by Analyzer$ResolveReferences (#1 below). > I have not yet looked at why. > {noformat} > - > - > - >+- SubqueryAlias `rightside` > +- SubqueryAlias `cacheddata` > +- Project [a#73, b#74, c#75, d#76] > +- SubqueryAlias `default`.`table1_vw` > (#1) ->+- View (`default`.`table1_vw`, [a#73,b#74,c#75,d#76]) > (#2) -> +- Project [cast(a#45 as int) AS a#73, cast(b#46 as int) AS > b#74, cast(c#47 as int) AS c#75, cast(d#48 as int) AS d#76] > +- Project [cast(a#37 as int) AS a#45, cast(b#38 as int) > AS b#46, cast(c#39 as int) AS c#47, cast(d#40 as int) AS d#48] > +- Project [a#37, b#38, c#39, d#40] >+- SubqueryAlias `default`.`table1` > +- Relation[a#37,b#38,c#39,d#40] orc > {noformat} > In a larger query (where cachedata may be referred on either side only > indirectly), this phenomenon can create certain oddities, as the fragment is > not replaced with InMemoryRelation, and the fragment is present when the plan > is optimized as a whole. > In Spark 2.1.3, Spark uses InMemoryRelation on both sides. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28156) Join plan sometimes does not use cached query
[ https://issues.apache.org/jira/browse/SPARK-28156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28156: Assignee: Apache Spark > Join plan sometimes does not use cached query > - > > Key: SPARK-28156 > URL: https://issues.apache.org/jira/browse/SPARK-28156 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3, 3.0.0, 2.4.3 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Major > > I came across a case where a cached query is referenced on both sides of a > join, but the InMemoryRelation is inserted on only one side. This case occurs > only when the cached query uses a (Hive-style) view. > Consider this example: > {noformat} > // create the data > val df1 = Seq.tabulate(10) { x => (x, x + 1, x + 2, x + 3) }.toDF("a", "b", > "c", "d") > df1.write.mode("overwrite").format("orc").saveAsTable("table1") > sql("drop view if exists table1_vw") > sql("create view table1_vw as select * from table1") > // create the cached query > val cacheddataDf = sql(""" > select a, b, c, d > from table1_vw > """) > import org.apache.spark.storage.StorageLevel.DISK_ONLY > cacheddataDf.createOrReplaceTempView("cacheddata") > cacheddataDf.persist(DISK_ONLY) > // main query > val queryDf = sql(s""" > select leftside.a, leftside.b > from cacheddata leftside > join cacheddata rightside > on leftside.a = rightside.a > """) > queryDf.explain(true) > {noformat} > Note that the optimized plan does not use an InMemoryRelation for the right > side, but instead just uses a Relation: > {noformat} > Project [a#45, b#46] > +- Join Inner, (a#45 = a#37) >:- Project [a#45, b#46] >: +- Filter isnotnull(a#45) >: +- InMemoryRelation [a#45, b#46, c#47, d#48], StorageLevel(disk, 1 > replicas) >: +- *(1) FileScan orc default.table1[a#37,b#38,c#39,d#40] > Batched: true, DataFilters: [], Format: ORC, Location: > InMemoryFileIndex[file:/Users/brobbins/github/spark_upstream/spark-warehouse/table1], > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct >+- Project [a#37] > +- Filter isnotnull(a#37) > +- Relation[a#37,b#38,c#39,d#40] orc > {noformat} > The fragment does not match the cached query because AliasViewChild adds an > extra projection under the View on the right side (see #2 below). > AliasViewChild adds the extra projection because the exprIds in the View's > output appears to have been renamed by Analyzer$ResolveReferences (#1 below). > I have not yet looked at why. > {noformat} > - > - > - >+- SubqueryAlias `rightside` > +- SubqueryAlias `cacheddata` > +- Project [a#73, b#74, c#75, d#76] > +- SubqueryAlias `default`.`table1_vw` > (#1) ->+- View (`default`.`table1_vw`, [a#73,b#74,c#75,d#76]) > (#2) -> +- Project [cast(a#45 as int) AS a#73, cast(b#46 as int) AS > b#74, cast(c#47 as int) AS c#75, cast(d#48 as int) AS d#76] > +- Project [cast(a#37 as int) AS a#45, cast(b#38 as int) > AS b#46, cast(c#39 as int) AS c#47, cast(d#40 as int) AS d#48] > +- Project [a#37, b#38, c#39, d#40] >+- SubqueryAlias `default`.`table1` > +- Relation[a#37,b#38,c#39,d#40] orc > {noformat} > In a larger query (where cachedata may be referred on either side only > indirectly), this phenomenon can create certain oddities, as the fragment is > not replaced with InMemoryRelation, and the fragment is present when the plan > is optimized as a whole. > In Spark 2.1.3, Spark uses InMemoryRelation on both sides. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28157) Make SHS check Spark event log file permission changes
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28157: Assignee: (was: Apache Spark) > Make SHS check Spark event log file permission changes > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Priority: Major > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a permanent blacklist for all event log files > failed once at reading. Although this reduces a lot of invalid accesses, > there is no way to see this log files back after the permissions are > recovered correctly. The only way has been restarting SHS. > Apache Spark is unable to know the permission recovery. However, we had > better give a second chances for those blacklisted files in a regular manner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28157) Make SHS check Spark event log file permission changes
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28157: Assignee: Apache Spark > Make SHS check Spark event log file permission changes > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a permanent blacklist for all event log files > failed once at reading. Although this reduces a lot of invalid accesses, > there is no way to see this log files back after the permissions are > recovered correctly. The only way has been restarting SHS. > Apache Spark is unable to know the permission recovery. However, we had > better give a second chances for those blacklisted files in a regular manner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28157) Make SHS check Spark event log file permission changes
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28157: -- Description: At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to the file system, and maintains a permanent blacklist for all event log files failed once at reading. Although this reduces a lot of invalid accesses, there is no way to see this log files back after the permissions are recovered correctly. The only way has been restarting SHS. Apache Spark is unable to know the permission recovery. However, we had better give a second chances for those blacklisted files in a regular manner. was: Since Spark 2.4.0, SPARK-24948 delegated access permission checks to the file system, and maintains a permanent blacklist for all event log files failed once at reading. Although this reduces a lot of invalid accesses, there is no way to see this log files back after the permissions are recovered correctly. The only way has been restarting SHS. Apache Spark is unable to know the permission recovery. However, we had better give a second chances for those blacklisted files in a regular manner. > Make SHS check Spark event log file permission changes > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Priority: Major > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a permanent blacklist for all event log files > failed once at reading. Although this reduces a lot of invalid accesses, > there is no way to see this log files back after the permissions are > recovered correctly. The only way has been restarting SHS. > Apache Spark is unable to know the permission recovery. However, we had > better give a second chances for those blacklisted files in a regular manner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28157) Make SHS check Spark event log file permission changes
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28157: -- Affects Version/s: 2.4.0 2.4.1 2.4.2 2.4.3 > Make SHS check Spark event log file permission changes > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Priority: Major > > Since Spark 2.4.0, SPARK-24948 delegated access permission checks to the file > system, and maintains a permanent blacklist for all event log files failed > once at reading. Although this reduces a lot of invalid accesses, there is no > way to see this log files back after the permissions are recovered correctly. > The only way has been restarting SHS. > Apache Spark is unable to know the permission recovery. However, we had > better give a second chances for those blacklisted files in a regular manner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28157) Make SHS check Spark event log file permission changes
Dongjoon Hyun created SPARK-28157: - Summary: Make SHS check Spark event log file permission changes Key: SPARK-28157 URL: https://issues.apache.org/jira/browse/SPARK-28157 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Dongjoon Hyun Since Spark 2.4.0, SPARK-24948 delegated access permission checks to the file system, and maintains a permanent blacklist for all event log files failed once at reading. Although this reduces a lot of invalid accesses, there is no way to see this log files back after the permissions are recovered correctly. The only way has been restarting SHS. Apache Spark is unable to know the permission recovery. However, we had better give a second chances for those blacklisted files in a regular manner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28135) ceil/ceiling/floor/power returns incorrect values
[ https://issues.apache.org/jira/browse/SPARK-28135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872015#comment-16872015 ] Tony Zhang commented on SPARK-28135: HIVE-21916 > ceil/ceiling/floor/power returns incorrect values > - > > Key: SPARK-28135 > URL: https://issues.apache.org/jira/browse/SPARK-28135 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > spark-sql> select ceil(double(1.2345678901234e+200)), > ceiling(double(1.2345678901234e+200)), floor(double(1.2345678901234e+200)), > power('1', 'NaN'); > 9223372036854775807 9223372036854775807 9223372036854775807 NaN > {noformat} > {noformat} > postgres=# select ceil(1.2345678901234e+200::float8), > ceiling(1.2345678901234e+200::float8), floor(1.2345678901234e+200::float8), > power('1', 'NaN'); > ceil | ceiling|floor | power > --+--+--+--- > 1.2345678901234e+200 | 1.2345678901234e+200 | 1.2345678901234e+200 | 1 > (1 row) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28153) input_file_name doesn't work with Python UDF in the same project
[ https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28153: Assignee: Apache Spark > input_file_name doesn't work with Python UDF in the same project > > > Key: SPARK-28153 > URL: https://issues.apache.org/jira/browse/SPARK-28153 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > {code} > from pyspark.sql.functions import udf, input_file_name > spark.range(10).write.mode("overwrite").parquet("/tmp/foo") > spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), > input_file_name()).show() > {code} > {code} > ++-+ > |(id)|input_file_name()| > ++-+ > | 8| | > | 5| | > | 0| | > | 9| | > | 6| | > | 2| | > | 3| | > | 4| | > | 7| | > | 1| | > ++-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28153) input_file_name doesn't work with Python UDF in the same project
[ https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28153: Assignee: (was: Apache Spark) > input_file_name doesn't work with Python UDF in the same project > > > Key: SPARK-28153 > URL: https://issues.apache.org/jira/browse/SPARK-28153 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > from pyspark.sql.functions import udf, input_file_name > spark.range(10).write.mode("overwrite").parquet("/tmp/foo") > spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), > input_file_name()).show() > {code} > {code} > ++-+ > |(id)|input_file_name()| > ++-+ > | 8| | > | 5| | > | 0| | > | 9| | > | 6| | > | 2| | > | 3| | > | 4| | > | 7| | > | 1| | > ++-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871976#comment-16871976 ] Apache Spark commented on SPARK-27100: -- User 'parthchandra' has created a pull request for this issue: https://github.com/apache/spark/pull/24957 > dag-scheduler-event-loop" java.lang.StackOverflowError > -- > > Key: SPARK-27100 > URL: https://issues.apache.org/jira/browse/SPARK-27100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.3.3 >Reporter: KaiXu >Priority: Major > Attachments: SPARK-27100-Overflow.txt, stderr > > > ALS in Spark MLlib causes StackOverflow: > /opt/sparkml/spark213/bin/spark-submit --properties-file > /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class > com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client > --num-executors 3 --executor-memory 322g > /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar > --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 > --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 > --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input > > Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28108) Simplify OrcFilters
[ https://issues.apache.org/jira/browse/SPARK-28108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28108. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24910 [https://github.com/apache/spark/pull/24910] > Simplify OrcFilters > --- > > Key: SPARK-28108 > URL: https://issues.apache.org/jira/browse/SPARK-28108 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > Fix For: 3.0.0 > > > In #24068, @IvanVergiliev reports that OrcFilters.createBuilder has > exponential complexity in the height of the filter tree due to the way the > check-and-build pattern is implemented. > This is because the same method createBuilder is called twice recursively for > any children under And/Or/Not nodes, so that inside the first call, the > second call is called as well(See description in #24068 for details). > Comparing to the approach in #24068, I propose a very simple solution for the > issue. We can rely on the result of convertibleFilters, which can build a > fully convertible tree. With it, we don't need to concern about the children > of a certain node is not convertible in method createBuilder. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28108) Simplify OrcFilters
[ https://issues.apache.org/jira/browse/SPARK-28108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-28108: --- Assignee: Gengliang Wang > Simplify OrcFilters > --- > > Key: SPARK-28108 > URL: https://issues.apache.org/jira/browse/SPARK-28108 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > > In #24068, @IvanVergiliev reports that OrcFilters.createBuilder has > exponential complexity in the height of the filter tree due to the way the > check-and-build pattern is implemented. > This is because the same method createBuilder is called twice recursively for > any children under And/Or/Not nodes, so that inside the first call, the > second call is called as well(See description in #24068 for details). > Comparing to the approach in #24068, I propose a very simple solution for the > issue. We can rely on the result of convertibleFilters, which can build a > fully convertible tree. With it, we don't need to concern about the children > of a certain node is not convertible in method createBuilder. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28114) Add Jenkins job for `Hadoop-3.2` profile
[ https://issues.apache.org/jira/browse/SPARK-28114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871948#comment-16871948 ] Dongjoon Hyun commented on SPARK-28114: --- Although I'm a read-only Jenkins member, I added new three jobs (except `compilation job`) to `Spark QA Test (Dashboard)`, [~shaneknapp]. > Add Jenkins job for `Hadoop-3.2` profile > > > Key: SPARK-28114 > URL: https://issues.apache.org/jira/browse/SPARK-28114 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: shane knapp >Priority: Major > > Spark 3.0 is a major version change. We want to have the following new Jobs. > 1. SBT with hadoop-3.2 > 2. Maven with hadoop-3.2 (on JDK8 and JDK11) > Also, shall we have a limit for the concurrent run for the following existing > job? Currently, it invokes multiple jobs concurrently. We can save the > resource by limiting to 1 like the other jobs. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing > We will drop four `branch-2.3` jobs at the end of August, 2019. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28114) Add Jenkins job for `Hadoop-3.2` profile
[ https://issues.apache.org/jira/browse/SPARK-28114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871944#comment-16871944 ] Dongjoon Hyun commented on SPARK-28114: --- Thank you so much, [~shaneknapp] and [~srowen]. The new four jobs works as expected. The job showed `--force` weirdly at that failure like the following, but `hadoop-1` profile removal seems to make it work back. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6509/console {code} + [[ hadoop-2.7 == hadoop-1 ]] + build/mvn --force -DzincPort=3245 -DskipTests -Phadoop-2.7 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl -Pmesos clean package {code} > Add Jenkins job for `Hadoop-3.2` profile > > > Key: SPARK-28114 > URL: https://issues.apache.org/jira/browse/SPARK-28114 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: shane knapp >Priority: Major > > Spark 3.0 is a major version change. We want to have the following new Jobs. > 1. SBT with hadoop-3.2 > 2. Maven with hadoop-3.2 (on JDK8 and JDK11) > Also, shall we have a limit for the concurrent run for the following existing > job? Currently, it invokes multiple jobs concurrently. We can save the > resource by limiting to 1 like the other jobs. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing > We will drop four `branch-2.3` jobs at the end of August, 2019. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27815) do not leak SaveMode to file source v2
[ https://issues.apache.org/jira/browse/SPARK-27815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27815: Assignee: Apache Spark > do not leak SaveMode to file source v2 > -- > > Key: SPARK-27815 > URL: https://issues.apache.org/jira/browse/SPARK-27815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Blocker > > Currently there is a hack in `DataFrameWriter`, which passes `SaveMode` to > file source v2. This should be removed and file source v2 should not accept > SaveMode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27815) do not leak SaveMode to file source v2
[ https://issues.apache.org/jira/browse/SPARK-27815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27815: Assignee: (was: Apache Spark) > do not leak SaveMode to file source v2 > -- > > Key: SPARK-27815 > URL: https://issues.apache.org/jira/browse/SPARK-27815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > > Currently there is a hack in `DataFrameWriter`, which passes `SaveMode` to > file source v2. This should be removed and file source v2 should not accept > SaveMode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28156) Join plan sometimes does not use cached query
Bruce Robbins created SPARK-28156: - Summary: Join plan sometimes does not use cached query Key: SPARK-28156 URL: https://issues.apache.org/jira/browse/SPARK-28156 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3, 2.3.3, 3.0.0 Reporter: Bruce Robbins I came across a case where a cached query is referenced on both sides of a join, but the InMemoryRelation is inserted on only one side. This case occurs only when the cached query uses a (Hive-style) view. Consider this example: {noformat} // create the data val df1 = Seq.tabulate(10) { x => (x, x + 1, x + 2, x + 3) }.toDF("a", "b", "c", "d") df1.write.mode("overwrite").format("orc").saveAsTable("table1") sql("drop view if exists table1_vw") sql("create view table1_vw as select * from table1") // create the cached query val cacheddataDf = sql(""" select a, b, c, d from table1_vw """) import org.apache.spark.storage.StorageLevel.DISK_ONLY cacheddataDf.createOrReplaceTempView("cacheddata") cacheddataDf.persist(DISK_ONLY) // main query val queryDf = sql(s""" select leftside.a, leftside.b from cacheddata leftside join cacheddata rightside on leftside.a = rightside.a """) queryDf.explain(true) {noformat} Note that the optimized plan does not use an InMemoryRelation for the right side, but instead just uses a Relation: {noformat} Project [a#45, b#46] +- Join Inner, (a#45 = a#37) :- Project [a#45, b#46] : +- Filter isnotnull(a#45) : +- InMemoryRelation [a#45, b#46, c#47, d#48], StorageLevel(disk, 1 replicas) : +- *(1) FileScan orc default.table1[a#37,b#38,c#39,d#40] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryFileIndex[file:/Users/brobbins/github/spark_upstream/spark-warehouse/table1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct +- Project [a#37] +- Filter isnotnull(a#37) +- Relation[a#37,b#38,c#39,d#40] orc {noformat} The fragment does not match the cached query because AliasViewChild adds an extra projection under the View on the right side (see #2 below). AliasViewChild adds the extra projection because the exprIds in the View's output appears to have been renamed by Analyzer$ResolveReferences (#1 below). I have not yet looked at why. {noformat} - - - +- SubqueryAlias `rightside` +- SubqueryAlias `cacheddata` +- Project [a#73, b#74, c#75, d#76] +- SubqueryAlias `default`.`table1_vw` (#1) ->+- View (`default`.`table1_vw`, [a#73,b#74,c#75,d#76]) (#2) -> +- Project [cast(a#45 as int) AS a#73, cast(b#46 as int) AS b#74, cast(c#47 as int) AS c#75, cast(d#48 as int) AS d#76] +- Project [cast(a#37 as int) AS a#45, cast(b#38 as int) AS b#46, cast(c#39 as int) AS c#47, cast(d#40 as int) AS d#48] +- Project [a#37, b#38, c#39, d#40] +- SubqueryAlias `default`.`table1` +- Relation[a#37,b#38,c#39,d#40] orc {noformat} In a larger query (where cachedata may be referred on either side only indirectly), this phenomenon can create certain oddities, as the fragment is not replaced with InMemoryRelation, and the fragment is present when the plan is optimized as a whole. In Spark 2.1.3, Spark uses InMemoryRelation on both sides. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28155) Improve SQL optimizer's predicate pushdown performance for cascading joins
Yesheng Ma created SPARK-28155: -- Summary: Improve SQL optimizer's predicate pushdown performance for cascading joins Key: SPARK-28155 URL: https://issues.apache.org/jira/browse/SPARK-28155 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yesheng Ma The current catalyst optimizer's predicate pushdown is divided into two separate rules: PushDownPredicate and PushThroughJoin. This is not efficient for optimizing cascading joins such as TPC-DS q64, where a whole default batch is re-executed just due to this. We need a more efficient approach to pushdown predicate as much as possible in a single pass. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28154) GMM fix double caching
[ https://issues.apache.org/jira/browse/SPARK-28154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28154: Assignee: (was: Apache Spark) > GMM fix double caching > -- > > Key: SPARK-28154 > URL: https://issues.apache.org/jira/browse/SPARK-28154 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0, 2.4.0, 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > The intermediate rdd is always cached. We should only cache it if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28154) GMM fix double caching
[ https://issues.apache.org/jira/browse/SPARK-28154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28154: Assignee: Apache Spark > GMM fix double caching > -- > > Key: SPARK-28154 > URL: https://issues.apache.org/jira/browse/SPARK-28154 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0, 2.4.0, 3.0.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > The intermediate rdd is always cached. We should only cache it if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28154) GMM fix double caching
[ https://issues.apache.org/jira/browse/SPARK-28154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871909#comment-16871909 ] Apache Spark commented on SPARK-28154: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/24919 > GMM fix double caching > -- > > Key: SPARK-28154 > URL: https://issues.apache.org/jira/browse/SPARK-28154 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0, 2.4.0, 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > The intermediate rdd is always cached. We should only cache it if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28154) GMM fix double caching
zhengruifeng created SPARK-28154: Summary: GMM fix double caching Key: SPARK-28154 URL: https://issues.apache.org/jira/browse/SPARK-28154 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.4.0, 2.3.0, 3.0.0 Reporter: zhengruifeng The intermediate rdd is always cached. We should only cache it if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27966) input_file_name empty when listing files in parallel
[ https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871885#comment-16871885 ] Hyukjin Kwon edited comment on SPARK-27966 at 6/25/19 1:01 AM: --- While debugging this, I found one case not working: {code} from pyspark.sql.functions import udf, input_file_name spark.range(10).write.mode("overwrite").parquet("/tmp/foo") spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), input_file_name()).show() {code} {code} ++-+ |(id)|input_file_name()| ++-+ | 8| | | 5| | | 0| | | 9| | | 6| | | 2| | | 3| | | 4| | | 7| | | 1| | ++-+ {code} But the reproducer described here works. Is this same issue or different issue, [~Chr_96er]? was (Author: hyukjin.kwon): While debugging this, I found one case not working: {code} from pyspark.sql.functions import udf, input_file_name spark.range(10).write.mode("overwrite").parquet("/tmp/foo") spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), input_file_name()).show() {code} {code} ++-+ |(id)|input_file_name()| ++-+ | 8| | | 5| | | 0| | | 9| | | 6| | | 2| | | 3| | | 4| | | 7| | | 1| | ++-+ {code} But the reproducer described here works. Is this same issue or different issue, [~Chr_96er]? > input_file_name empty when listing files in parallel > > > Key: SPARK-27966 > URL: https://issues.apache.org/jira/browse/SPARK-27966 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 > Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11) > > Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 > Workers: 3 > Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 >Reporter: Christian Homberg >Priority: Minor > Attachments: input_file_name_bug > > > I ran into an issue similar and probably related to SPARK-26128. The > _org.apache.spark.sql.functions.input_file_name_ is sometimes empty. > > {code:java} > df.select(input_file_name()).show(5,false) > {code} > > {code:java} > +-+ > |input_file_name()| > +-+ > | | > | | > | | > | | > | | > +-+ > {code} > My environment is databricks and debugging the Log4j output showed me that > the issue occurred when the files are being listed in parallel, e.g. when > {code:java} > 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 127; threshold: 32 > 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories > in parallel under:{code} > > Everything's fine as long as > {code:java} > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 6; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > {code} > > Setting spark.sql.sources.parallelPartitionDiscovery.threshold to > resolves the issue for me. > > *edit: the problem is not exclusively linked to listing files in parallel. > I've setup a larger cluster for which after parallel file listing the > input_file_name did return the correct filename. After inspecting the log4j > again, I assume that it's linked to some kind of MetaStore being full. I've > attached a section of the log4j output that I think should indicate why it's > failing. If you need more, please let me know.* > ** > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) -
[jira] [Created] (SPARK-28153) input_file_name doesn't work with Python UDF in the same project
Hyukjin Kwon created SPARK-28153: Summary: input_file_name doesn't work with Python UDF in the same project Key: SPARK-28153 URL: https://issues.apache.org/jira/browse/SPARK-28153 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.0 Reporter: Hyukjin Kwon {code} from pyspark.sql.functions import udf, input_file_name spark.range(10).write.mode("overwrite").parquet("/tmp/foo") spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), input_file_name()).show() {code} {code} ++-+ |(id)|input_file_name()| ++-+ | 8| | | 5| | | 0| | | 9| | | 6| | | 2| | | 3| | | 4| | | 7| | | 1| | ++-+ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27966) input_file_name empty when listing files in parallel
[ https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871885#comment-16871885 ] Hyukjin Kwon commented on SPARK-27966: -- While debugging this, I found one case not working: {code} from pyspark.sql.functions import udf, input_file_name spark.range(10).write.mode("overwrite").parquet("/tmp/foo") spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), input_file_name()).show() {code} {code} ++-+ |(id)|input_file_name()| ++-+ | 8| | | 5| | | 0| | | 9| | | 6| | | 2| | | 3| | | 4| | | 7| | | 1| | ++-+ {code} But the reproducer described here works. Is this same issue or different issue, [~Chr_96er]? > input_file_name empty when listing files in parallel > > > Key: SPARK-27966 > URL: https://issues.apache.org/jira/browse/SPARK-27966 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 > Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11) > > Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 > Workers: 3 > Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 >Reporter: Christian Homberg >Priority: Minor > Attachments: input_file_name_bug > > > I ran into an issue similar and probably related to SPARK-26128. The > _org.apache.spark.sql.functions.input_file_name_ is sometimes empty. > > {code:java} > df.select(input_file_name()).show(5,false) > {code} > > {code:java} > +-+ > |input_file_name()| > +-+ > | | > | | > | | > | | > | | > +-+ > {code} > My environment is databricks and debugging the Log4j output showed me that > the issue occurred when the files are being listed in parallel, e.g. when > {code:java} > 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 127; threshold: 32 > 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories > in parallel under:{code} > > Everything's fine as long as > {code:java} > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 6; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > {code} > > Setting spark.sql.sources.parallelPartitionDiscovery.threshold to > resolves the issue for me. > > *edit: the problem is not exclusively linked to listing files in parallel. > I've setup a larger cluster for which after parallel file listing the > input_file_name did return the correct filename. After inspecting the log4j > again, I assume that it's linked to some kind of MetaStore being full. I've > attached a section of the log4j output that I think should indicate why it's > failing. If you need more, please let me know.* > ** > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26896) Add maven profiles for running tests with JDK 11
[ https://issues.apache.org/jira/browse/SPARK-26896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26896. --- Resolution: Not A Problem > Add maven profiles for running tests with JDK 11 > > > Key: SPARK-26896 > URL: https://issues.apache.org/jira/browse/SPARK-26896 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > Running unit tests w/ JDK 11 trips over some issues w/ the new module system. > These can be worked around with the new {{--add-opens}} etc. commands. I > think we need to add a build profile for JDK 11 to add some extra args to the > test runners. > In particular: > 1) removal of jaxb from java itself (used in pmml export in mllib) > 2) Some reflective access which results in failures, eg. > {noformat} > Unable to make field jdk.internal.ref.PhantomCleanable > jdk.internal.ref.PhantomCleanable.prev accessible: module java.base does > not "opens jdk.internal.ref" to unnamed module > {noformat} > 3) Some reflective access which results in warnings (you can add > {{--illegal-access=warn}} to see all of these). > All I'm proposing we do here is put in the required handling to make these > problems go away, not necessarily do the "right" thing by no longer > referencing these unexposed internals. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics
[ https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871839#comment-16871839 ] Steve Loughran commented on SPARK-28091: The real source of metrics for S3A (and the Azure stores, Google cloud) is actually the per-Instance StorageStatistics you can get from {{getStorageStatistics)) on an instance; look at {{org.apache.hadoop.fs.s3a.Statistic}} for what is collected there...pretty much every operation has its own counter. It'd be interesting to see what your collection code does here. Also be interesting to think about whether people could extend it to get low level stats from CPUs themselves. One issue I have found with trying to collect per-query stats is that the filesystem-instance counters can't be tied down to specific queries, as they aren't per-thread. No good solutions there, at least nothing under dev, though the Impala team have been asking for stuff (primarily collecting input stream stats on seek cost) > Extend Spark metrics system with executor plugin metrics > > > Key: SPARK-28091 > URL: https://issues.apache.org/jira/browse/SPARK-28091 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > > This proposes to improve Spark instrumentation by adding a hook for Spark > executor plugin metrics to the Spark metrics systems implemented with the > Dropwizard/Codahale library. > Context: The Spark metrics system provides a large variety of metrics, see > also SPARK-26890, useful to monitor and troubleshoot Spark workloads. A > typical workflow is to sink the metrics to a storage system and build > dashboards on top of that. > Improvement: The original goal of this work was to add instrumentation for S3 > filesystem access metrics by Spark job. Currently, [[ExecutorSource]] > instruments HDFS and local filesystem metrics. Rather than extending the code > there, we proposes to add a metrics plugin system which is of more flexible > and general use. > Advantages: > * The metric plugin system makes it easy to implement instrumentation for S3 > access by Spark jobs. > * The metrics plugin system allows for easy extensions of how Spark collects > HDFS-related workload metrics. This is currently done using the Hadoop > Filesystem GetAllStatistics method, which is deprecated in recent versions of > Hadoop. Recent versions of Hadoop Filesystem recommend using method > GetGlobalStorageStatistics, which also provides several additional metrics. > GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been > introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an > easy way to “opt in” using such new API calls for those deploying suitable > Hadoop versions. > * We also have the use case of adding Hadoop filesystem monitoring for a > custom Hadoop compliant filesystem in use in our organization (EOS using the > XRootD protocol). The metrics plugin infrastructure makes this easy to do. > Others may have similar use cases. > * More generally, this method makes it straightforward to plug in Filesystem > and other metrics to the Spark monitoring system. Future work on plugin > implementation can address extending monitoring to measure usage of external > resources (OS, filesystem, network, accelerator cards, etc), that maybe would > not normally be considered general enough for inclusion in Apache Spark code, > but that can be nevertheless useful for specialized use cases, tests or > troubleshooting. > Implementation: > The proposed implementation is currently a WIP open for comments and > improvements. It is based on the work on Executor Plugin of SPARK-24918 and > builds on recent work on extending Spark executor metrics, such as SPARK-25228 > Tests and examples: > This has been so far manually tested running Spark on YARN and K8S clusters, > in particular for monitoring S3 and for extending HDFS instrumentation with > the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric > plugin example and code used for testing are available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28150) Failure to create multiple contexts in same JVM with Kerberos auth
[ https://issues.apache.org/jira/browse/SPARK-28150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28150: Assignee: Apache Spark > Failure to create multiple contexts in same JVM with Kerberos auth > -- > > Key: SPARK-28150 > URL: https://issues.apache.org/jira/browse/SPARK-28150 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > Take the following small app that creates multiple contexts (not > concurrently): > {code} > from pyspark.context import SparkContext > import time > for i in range(2): > with SparkContext() as sc: > pass > time.sleep(5) > {code} > This fails when kerberos (without dt renewal) is being used: > {noformat} > 19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext. > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49) > Caused by: > org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: Error > calling method hbase.pb.AuthenticationService.GetAuthenticationToken > at > org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71) > Caused by: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.security.AccessDeniedException): > org.apache.hadoop.hbase.security.AccessDeniedException: Token generation > only allowed for Kerberos authenticated clients > at > org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:126) > {noformat} > If you enable dt renewal things work since the codes takes a slightly > different path when generating the initial delegation tokens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28150) Failure to create multiple contexts in same JVM with Kerberos auth
[ https://issues.apache.org/jira/browse/SPARK-28150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28150: Assignee: (was: Apache Spark) > Failure to create multiple contexts in same JVM with Kerberos auth > -- > > Key: SPARK-28150 > URL: https://issues.apache.org/jira/browse/SPARK-28150 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Take the following small app that creates multiple contexts (not > concurrently): > {code} > from pyspark.context import SparkContext > import time > for i in range(2): > with SparkContext() as sc: > pass > time.sleep(5) > {code} > This fails when kerberos (without dt renewal) is being used: > {noformat} > 19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext. > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49) > Caused by: > org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: Error > calling method hbase.pb.AuthenticationService.GetAuthenticationToken > at > org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71) > Caused by: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.security.AccessDeniedException): > org.apache.hadoop.hbase.security.AccessDeniedException: Token generation > only allowed for Kerberos authenticated clients > at > org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:126) > {noformat} > If you enable dt renewal things work since the codes takes a slightly > different path when generating the initial delegation tokens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28152) ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector
Shiv Prashant Sood created SPARK-28152: -- Summary: ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector Key: SPARK-28152 URL: https://issues.apache.org/jira/browse/SPARK-28152 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3, 3.0.0 Reporter: Shiv Prashant Sood ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables or spark data frame being created with unintended types. Some example issue * Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT. Thus a larger table that expected. * read results in a dataframe with type INTEGER as opposed to ShortType FloatTypes have a issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24432: -- Assignee: (was: Marcelo Vanzin) > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: Yinan Li >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24432: -- Assignee: Marcelo Vanzin > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: Yinan Li >Assignee: Marcelo Vanzin >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16474) Global Aggregation doesn't seem to work at all
[ https://issues.apache.org/jira/browse/SPARK-16474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871798#comment-16871798 ] Josh Rosen commented on SPARK-16474: I just ran into this same issue. The problem here is that {{RelationalGroupedDataset}} only supports custom aggregate expressions which are defined over \{{Row}}s; you can see this from the code at [https://github.com/apache/spark/blob/67042e90e763f6a2716b942c3f887e94813889ce/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L225]. The solution here is to use the typed {{KeyValueGroupedDataset}} API via {{groupByKey()}}. Unfortunately there's not a great way to provide better error messages here: {{Aggregator}} doesn't require {{TypeTag}} or {{ClassTag}} for its input type, so we lack a reliable mechanism to detect and fail-fast when we're passed an aggregate over non-Rows. > Global Aggregation doesn't seem to work at all > --- > > Key: SPARK-16474 > URL: https://issues.apache.org/jira/browse/SPARK-16474 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.2, 2.0.0 >Reporter: Amit Sela >Priority: Major > > Executing a global aggregation (not grouped by key) fails. > Take the following code for example: > {code} > val session = SparkSession.builder() > .appName("TestGlobalAggregator") > .master("local[*]") > .getOrCreate() > import session.implicits._ > val ds1 = List(1, 2, 3).toDS > val ds2 = ds1.agg( > new Aggregator[Int, Int, Int]{ > def zero: Int = 0 > def reduce(b: Int, a: Int): Int = b + a > def merge(b1: Int, b2: Int): Int = b1 + b2 > def finish(reduction: Int): Int = reduction > def bufferEncoder: Encoder[Int] = implicitly[Encoder[Int]] > def outputEncoder: Encoder[Int] = implicitly[Encoder[Int]] > }.toColumn) > ds2.printSchema > ds2.show > {code} > I would expect the result to be 6, but instead I get the following exception: > {noformat} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast > to java.lang.Integer > at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) > . > {noformat} > Trying the same code on DataFrames in 1.6.2 results in: > {noformat} > Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved > operator 'Aggregate [(anon$1(),mode=Complete,isDistinct=false) AS > anon$1()#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > .. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28151) ByteType is not correctly mapped for read/write of SQLServer tables
Shiv Prashant Sood created SPARK-28151: -- Summary: ByteType is not correctly mapped for read/write of SQLServer tables Key: SPARK-28151 URL: https://issues.apache.org/jira/browse/SPARK-28151 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3, 3.0.0 Reporter: Shiv Prashant Sood Writing dataframe with column type BYTETYPE fails when using JDBC connector for SQL Server. Append and Read of tables also fail. The problem is due 1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped to TINYINT {color:#cc7832}case {color}ByteType => Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, {color}java.sql.Types.{color:#9876aa}TINYINT{color})) In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from the point of view of upcasting, but will lead to 4 byte allocation rather than 1 byte for BYTETYPE. 2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: Metadata). The function sets the value in RDD row. The value is set per the data type. Here there is no mapping for BYTETYPE and thus results will result in an error when getCatalystType() is fixed. Note : These issues were found when reading/writing with SQLServer. Will be submitting a PR soon to fix these mappings in MSSQLServerDialect. Error seen when writing table (JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #2: *Cannot find data type BYTE*.) com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #2: Cannot find data type BYTE. com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254) com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608) com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859) .. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25994) SPIP: Property Graphs, Cypher Queries, and Algorithms
[ https://issues.apache.org/jira/browse/SPARK-25994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871677#comment-16871677 ] Ruben Berenguel commented on SPARK-25994: - Hi [~mju] sounds good to me (sorry for the delay, conference season). I assume this next work is in SPARK-27303 and the corresponding PR? I'm not sure how much time I can give it in the coming couple of weeks, but I'll try to stay on top of the API design as much as I can > SPIP: Property Graphs, Cypher Queries, and Algorithms > - > > Key: SPARK-25994 > URL: https://issues.apache.org/jira/browse/SPARK-25994 > Project: Spark > Issue Type: Epic > Components: Graph >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Martin Junghanns >Priority: Major > Labels: SPIP > > Copied from the SPIP doc: > {quote} > GraphX was one of the foundational pillars of the Spark project, and is the > current graph component. This reflects the importance of the graphs data > model, which naturally pairs with an important class of analytic function, > the network or graph algorithm. > However, GraphX is not actively maintained. It is based on RDDs, and cannot > exploit Spark 2’s Catalyst query engine. GraphX is only available to Scala > users. > GraphFrames is a Spark package, which implements DataFrame-based graph > algorithms, and also incorporates simple graph pattern matching with fixed > length patterns (called “motifs”). GraphFrames is based on DataFrames, but > has a semantically weak graph data model (based on untyped edges and > vertices). The motif pattern matching facility is very limited by comparison > with the well-established Cypher language. > The Property Graph data model has become quite widespread in recent years, > and is the primary focus of commercial graph data management and of graph > data research, both for on-premises and cloud data management. Many users of > transactional graph databases also wish to work with immutable graphs in > Spark. > The idea is to define a Cypher-compatible Property Graph type based on > DataFrames; to replace GraphFrames querying with Cypher; to reimplement > GraphX/GraphFrames algos on the PropertyGraph type. > To achieve this goal, a core subset of Cypher for Apache Spark (CAPS), > reusing existing proven designs and code, will be employed in Spark 3.0. This > graph query processor, like CAPS, will overlay and drive the SparkSQL > Catalyst query engine, using the CAPS graph query planner. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28140) Pyspark API to create spark.mllib RowMatrix from DataFrame
[ https://issues.apache.org/jira/browse/SPARK-28140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28140: Assignee: (was: Apache Spark) > Pyspark API to create spark.mllib RowMatrix from DataFrame > -- > > Key: SPARK-28140 > URL: https://issues.apache.org/jira/browse/SPARK-28140 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Henry Davidge >Priority: Major > > Since many functions are only implemented in spark.mllib, it is often > necessary to convert DataFrames of spark.ml vectors to spark.mllib > distributed matrix formats. The first step, converting the spark.ml vectors > to the spark.mllib equivalent, is straightforward. However, to the best of my > knowledge it's not possible to convert the resulting DataFrame to a RowMatrix > without using a python lambda function, which can have a significant > performance hit. In my recent use case, SVD took 3.5m using the Scala API, > but 12m using Python. > To get around this performance hit, I propose adding a constructor to the > Pyspark RowMatrix class that accepts a DataFrame with a single column of > spark.mllib vectors. I'd be happy to add an equivalent API for > IndexedRowMatrix if there is demand. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28140) Pyspark API to create spark.mllib RowMatrix from DataFrame
[ https://issues.apache.org/jira/browse/SPARK-28140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28140: Assignee: Apache Spark > Pyspark API to create spark.mllib RowMatrix from DataFrame > -- > > Key: SPARK-28140 > URL: https://issues.apache.org/jira/browse/SPARK-28140 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Henry Davidge >Assignee: Apache Spark >Priority: Major > > Since many functions are only implemented in spark.mllib, it is often > necessary to convert DataFrames of spark.ml vectors to spark.mllib > distributed matrix formats. The first step, converting the spark.ml vectors > to the spark.mllib equivalent, is straightforward. However, to the best of my > knowledge it's not possible to convert the resulting DataFrame to a RowMatrix > without using a python lambda function, which can have a significant > performance hit. In my recent use case, SVD took 3.5m using the Scala API, > but 12m using Python. > To get around this performance hit, I propose adding a constructor to the > Pyspark RowMatrix class that accepts a DataFrame with a single column of > spark.mllib vectors. I'd be happy to add an equivalent API for > IndexedRowMatrix if there is demand. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28150) Failure to create multiple contexts in same JVM with Kerberos auth
[ https://issues.apache.org/jira/browse/SPARK-28150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-28150: --- Description: Take the following small app that creates multiple contexts (not concurrently): {code} from pyspark.context import SparkContext import time for i in range(2): with SparkContext() as sc: pass time.sleep(5) {code} This fails when kerberos (without dt renewal) is being used: {noformat} 19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext. java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49) Caused by: org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: Error calling method hbase.pb.AuthenticationService.GetAuthenticationToken at org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71) Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.security.AccessDeniedException): org.apache.hadoop.hbase.security.AccessDeniedException: Token generation only allowed for Kerberos authenticated clients at org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:126) {noformat} If you enable dt renewal things work since the codes takes a slightly different path when generating the initial delegation tokens. was: Take the following small app that creates multiple contexts (not concurrently): {code} from pyspark.context import SparkContext import time for i in range(2): with SparkContext() as sc: pass time.sleep(5) {code} This fails when kerberos (without dt renewal) is being used: {noformat} 19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext. java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49) Caused by: org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: Error calling method hbase.pb.AuthenticationService.GetAuthenticationToken at org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71) {noformat} If you enable dt renewal things work since the codes takes a slightly different path when generating the initial delegation tokens. > Failure to create multiple contexts in same JVM with Kerberos auth > -- > > Key: SPARK-28150 > URL: https://issues.apache.org/jira/browse/SPARK-28150 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Take the following small app that creates multiple contexts (not > concurrently): > {code} > from pyspark.context import SparkContext > import time > for i in range(2): > with SparkContext() as sc: > pass > time.sleep(5) > {code} > This fails when kerberos (without dt renewal) is being used: > {noformat} > 19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext. > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49) > Caused by: > org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: Error > calling method hbase.pb.AuthenticationService.GetAuthenticationToken > at > org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71) > Caused by: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.security.AccessDeniedException): > org.apache.hadoop.hbase.security.AccessDeniedExcep
[jira] [Created] (SPARK-28150) Failure to create multiple contexts in same JVM with Kerberos auth
Marcelo Vanzin created SPARK-28150: -- Summary: Failure to create multiple contexts in same JVM with Kerberos auth Key: SPARK-28150 URL: https://issues.apache.org/jira/browse/SPARK-28150 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Marcelo Vanzin Take the following small app that creates multiple contexts (not concurrently): {code} from pyspark.context import SparkContext import time for i in range(2): with SparkContext() as sc: pass time.sleep(5) {code} This fails when kerberos (without dt renewal) is being used: {noformat} 19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext. java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49) Caused by: org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: Error calling method hbase.pb.AuthenticationService.GetAuthenticationToken at org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71) {noformat} If you enable dt renewal things work since the codes takes a slightly different path when generating the initial delegation tokens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28114) Add Jenkins job for `Hadoop-3.2` profile
[ https://issues.apache.org/jira/browse/SPARK-28114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp resolved SPARK-28114. - Resolution: Fixed going to mark this as resolved. builds are green (WRT hadoop-3.2). > Add Jenkins job for `Hadoop-3.2` profile > > > Key: SPARK-28114 > URL: https://issues.apache.org/jira/browse/SPARK-28114 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: shane knapp >Priority: Major > > Spark 3.0 is a major version change. We want to have the following new Jobs. > 1. SBT with hadoop-3.2 > 2. Maven with hadoop-3.2 (on JDK8 and JDK11) > Also, shall we have a limit for the concurrent run for the following existing > job? Currently, it invokes multiple jobs concurrently. We can save the > resource by limiting to 1 like the other jobs. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing > We will drop four `branch-2.3` jobs at the end of August, 2019. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT
[ https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-28003. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24844 [https://github.com/apache/spark/pull/24844] > spark.createDataFrame with Arrow doesn't work with pandas.NaT > -- > > Key: SPARK-28003 > URL: https://issues.apache.org/jira/browse/SPARK-28003 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.3, 2.4.3 >Reporter: Li Jin >Assignee: Li Jin >Priority: Major > Fix For: 3.0.0 > > > {code:java} > import pandas as pd > dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100 > pdf1 = pd.DataFrame({'time': dt1}) > df1 = self.spark.createDataFrame(pdf1) > {code} > The example above doesn't work with arrow enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT
[ https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-28003: Assignee: Li Jin > spark.createDataFrame with Arrow doesn't work with pandas.NaT > -- > > Key: SPARK-28003 > URL: https://issues.apache.org/jira/browse/SPARK-28003 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.3, 2.4.3 >Reporter: Li Jin >Assignee: Li Jin >Priority: Major > > {code:java} > import pandas as pd > dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100 > pdf1 = pd.DataFrame({'time': dt1}) > df1 = self.spark.createDataFrame(pdf1) > {code} > The example above doesn't work with arrow enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27992) PySpark socket server should sync with JVM connection thread future
[ https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-27992: - Description: Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark job are not propagated to Python. This is because toLocalIterator() and toPandas() with Arrow enabled run Spark jobs asynchronously in a background thread, after creating the socket connection info. The fix for these was to catch a SparkException if the job errored and then send the exception through the pyspark serializer. A better fix would be to allow Python to await on the serving thread future and join the thread. That way if the serving thread throws an exception, it will be propagated on the call to awaitResult. was: Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark job are not propagated to Python. This is because toLocalIterator() and toPandas() with Arrow enabled run Spark jobs asynchronously in a background thread, after creating the socket connection info. The fix for these was to catch a SparkException if the job errored and then send the exception through the pyspark serializer. A better fix would be to allow Python to synchronize on the serving thread future. That way if the serving thread throws an exception, it will be propagated on the synchronization call. > PySpark socket server should sync with JVM connection thread future > --- > > Key: SPARK-27992 > URL: https://issues.apache.org/jira/browse/SPARK-27992 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark > job are not propagated to Python. This is because toLocalIterator() and > toPandas() with Arrow enabled run Spark jobs asynchronously in a background > thread, after creating the socket connection info. The fix for these was to > catch a SparkException if the job errored and then send the exception through > the pyspark serializer. > A better fix would be to allow Python to await on the serving thread future > and join the thread. That way if the serving thread throws an exception, it > will be propagated on the call to awaitResult. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28149) Disable negeative DNS caching
[ https://issues.apache.org/jira/browse/SPARK-28149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Luis Pedrosa updated SPARK-28149: -- Description: By default JVM caches the failures for the DNS resolutions, by default is cached by 10 seconds. Alpine JDK used in the images for kubernetes has a default timout of 5 seconds. This means that in clusters with slow init time (network sidecar pods, slow network start up) executor will never run, because the first attempt to connect to the driver will fail, and that failure will be cached, causing the retries to happen in a tight loop without actually trying again. The proposed implementation would be to add to the entrypoint.sh (that is exclusive for k8s) to alter the file with the dns caching, and disable it if there's an environment variable as "DISABLE_DNS_NEGATIVE_CACHING" defined. was: By default JVM caches the failures for the DNS resolutions, by default is cached by 10 seconds. Alpine JDK used in the images for kubernetes has a default timout of 5 seconds. This means that in clusters with slow init time (network sidecar pods, slow network start up) executor will never run, because the first attempt to connect to the driver will fail, and that failure will be cached, causing the retries to happen in a tight loop without actually trying again. The proposed implementation would be to add to the entrypoint.sh (that is exclusive for k8s) to alter the file with the dns caching, and disable it if there's an environment variable as "DISABLE_DNS_NEGATIVE_CAHING" defined. > Disable negeative DNS caching > - > > Key: SPARK-28149 > URL: https://issues.apache.org/jira/browse/SPARK-28149 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Jose Luis Pedrosa >Priority: Minor > > By default JVM caches the failures for the DNS resolutions, by default is > cached by 10 seconds. > Alpine JDK used in the images for kubernetes has a default timout of 5 > seconds. > This means that in clusters with slow init time (network sidecar pods, slow > network start up) executor will never run, because the first attempt to > connect to the driver will fail, and that failure will be cached, causing > the retries to happen in a tight loop without actually trying again. > > The proposed implementation would be to add to the entrypoint.sh (that is > exclusive for k8s) to alter the file with the dns caching, and disable it if > there's an environment variable as "DISABLE_DNS_NEGATIVE_CACHING" defined. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28149) Disable negeative DNS caching
[ https://issues.apache.org/jira/browse/SPARK-28149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Luis Pedrosa updated SPARK-28149: -- Description: By default JVM caches the failures for the DNS resolutions, by default is cached by 10 seconds. Alpine JDK used in the images for kubernetes has a default timout of 5 seconds. This means that in clusters with slow init time (network sidecar pods, slow network start up) executor will never run, because the first attempt to connect to the driver will fail, and that failure will be cached, causing the retries to happen in a tight loop without actually trying again. The proposed implementation would be to add to the entrypoint.sh (that is exclusive for k8s) to alter the file with the dns caching, and disable it if there's an environment variable as "DISABLE_DNS_NEGATIVE_CAHING" defined. was: By default JVM caches the failures for the DNS resolutions, by default is cached by 10 seconds. Alpine JDK used in the images for kubernetes has a default timout of 5 seconds. This means that in clusters with slow init time (network sidecar pods, slow network start up) executor will never run, because the first attempt to connect to the driver will fail, and that failure will be cached, causing the retries to happen in a tight loop without actually trying again. > Disable negeative DNS caching > - > > Key: SPARK-28149 > URL: https://issues.apache.org/jira/browse/SPARK-28149 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Jose Luis Pedrosa >Priority: Minor > > By default JVM caches the failures for the DNS resolutions, by default is > cached by 10 seconds. > Alpine JDK used in the images for kubernetes has a default timout of 5 > seconds. > This means that in clusters with slow init time (network sidecar pods, slow > network start up) executor will never run, because the first attempt to > connect to the driver will fail, and that failure will be cached, causing > the retries to happen in a tight loop without actually trying again. > > The proposed implementation would be to add to the entrypoint.sh (that is > exclusive for k8s) to alter the file with the dns caching, and disable it if > there's an environment variable as "DISABLE_DNS_NEGATIVE_CAHING" defined. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28149) Disable negeative DNS caching
[ https://issues.apache.org/jira/browse/SPARK-28149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Luis Pedrosa updated SPARK-28149: -- Summary: Disable negeative DNS caching (was: Disable negeative DNS caching.) > Disable negeative DNS caching > - > > Key: SPARK-28149 > URL: https://issues.apache.org/jira/browse/SPARK-28149 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Jose Luis Pedrosa >Priority: Minor > > By default JVM caches the failures for the DNS resolutions, by default is > cached by 10 seconds. > Alpine JDK used in the images for kubernetes has a default timout of 5 > seconds. > This means that in clusters with slow init time (network sidecar pods, slow > network start up) executor will never run, because the first attempt to > connect to the driver will fail, and that failure will be cached, causing > the retries to happen in a tight loop without actually trying again. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28149) Disable negeative DNS caching.
Jose Luis Pedrosa created SPARK-28149: - Summary: Disable negeative DNS caching. Key: SPARK-28149 URL: https://issues.apache.org/jira/browse/SPARK-28149 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.3 Reporter: Jose Luis Pedrosa By default JVM caches the failures for the DNS resolutions, by default is cached by 10 seconds. Alpine JDK used in the images for kubernetes has a default timout of 5 seconds. This means that in clusters with slow init time (network sidecar pods, slow network start up) executor will never run, because the first attempt to connect to the driver will fail, and that failure will be cached, causing the retries to happen in a tight loop without actually trying again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28148) repartition after join is not optimized away
colin fang created SPARK-28148: -- Summary: repartition after join is not optimized away Key: SPARK-28148 URL: https://issues.apache.org/jira/browse/SPARK-28148 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: colin fang Partitioning & sorting is usually retained after join. {code} spark.conf.set('spark.sql.shuffle.partitions', '42') df1 = spark.range(500, numPartitions=5) df2 = spark.range(1000, numPartitions=5) df3 = spark.range(2000, numPartitions=5) # Reuse previous partitions & sort. df1.join(df2, on='id').join(df3, on='id').explain() # == Physical Plan == # *(8) Project [id#367L] # +- *(8) SortMergeJoin [id#367L], [id#374L], Inner #:- *(5) Project [id#367L] #: +- *(5) SortMergeJoin [id#367L], [id#369L], Inner #: :- *(2) Sort [id#367L ASC NULLS FIRST], false, 0 #: : +- Exchange hashpartitioning(id#367L, 42) #: : +- *(1) Range (0, 500, step=1, splits=5) #: +- *(4) Sort [id#369L ASC NULLS FIRST], false, 0 #:+- Exchange hashpartitioning(id#369L, 42) #: +- *(3) Range (0, 1000, step=1, splits=5) #+- *(7) Sort [id#374L ASC NULLS FIRST], false, 0 # +- Exchange hashpartitioning(id#374L, 42) # +- *(6) Range (0, 2000, step=1, splits=5) {code} However here: Partitions persist through left join, sort doesn't. {code} df1.join(df2, on='id', how='left').repartition('id').sortWithinPartitions('id').explain() # == Physical Plan == # *(5) Sort [id#367L ASC NULLS FIRST], false, 0 # +- *(5) Project [id#367L] #+- SortMergeJoin [id#367L], [id#369L], LeftOuter # :- *(2) Sort [id#367L ASC NULLS FIRST], false, 0 # : +- Exchange hashpartitioning(id#367L, 42) # : +- *(1) Range (0, 500, step=1, splits=5) # +- *(4) Sort [id#369L ASC NULLS FIRST], false, 0 # +- Exchange hashpartitioning(id#369L, 42) # +- *(3) Range (0, 1000, step=1, splits=5) {code} Also here: Partitions do not persist though inner join. {code} df1.join(df2, on='id').repartition('id').sortWithinPartitions('id').explain() # == Physical Plan == # *(6) Sort [id#367L ASC NULLS FIRST], false, 0 # +- Exchange hashpartitioning(id#367L, 42) #+- *(5) Project [id#367L] # +- *(5) SortMergeJoin [id#367L], [id#369L], Inner # :- *(2) Sort [id#367L ASC NULLS FIRST], false, 0 # : +- Exchange hashpartitioning(id#367L, 42) # : +- *(1) Range (0, 500, step=1, splits=5) # +- *(4) Sort [id#369L ASC NULLS FIRST], false, 0 # +- Exchange hashpartitioning(id#369L, 42) #+- *(3) Range (0, 1000, step=1, splits=5) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28146) Support IS OF () predicate
[ https://issues.apache.org/jira/browse/SPARK-28146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871474#comment-16871474 ] Yuming Wang commented on SPARK-28146: - {{IS OF}} does not conform to the ISO SQL behavior, so it is undocumented: https://github.com/postgres/postgres/blob/REL_12_BETA1/doc/src/sgml/func.sgml#L518-L535 > Support IS OF () predicate > > > Key: SPARK-28146 > URL: https://issues.apache.org/jira/browse/SPARK-28146 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Peter Toth >Priority: Minor > > PostgreSQL supports IS OF () predicate, for example the following query > is valid: > {noformat} > select 1 is of (int), true is of (bool) > true true > {noformat} > I can't find PostgreSQL documentation about it, but here is how it works in > Oracle: > > [https://docs.oracle.com/cd/B28359_01/server.111/b28286/conditions014.htm#SQLRF52157] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28147) Support RETURNING cause
[ https://issues.apache.org/jira/browse/SPARK-28147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-28147: --- Description: PostgreSQL supports RETURNING cause on INSERT/UPDATE/DELETE statements to return date from the modified rows. [https://www.postgresql.org/docs/9.5/dml-returning.html] was: PostgreSQL supports RETURNING cause on INSERT/UPDATE/DELETE statements to return the modified rows. [https://www.postgresql.org/docs/9.5/dml-returning.html] > Support RETURNING cause > --- > > Key: SPARK-28147 > URL: https://issues.apache.org/jira/browse/SPARK-28147 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Peter Toth >Priority: Major > > PostgreSQL supports RETURNING cause on INSERT/UPDATE/DELETE statements to > return date from the modified rows. > [https://www.postgresql.org/docs/9.5/dml-returning.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28147) Support RETURNING cause
Peter Toth created SPARK-28147: -- Summary: Support RETURNING cause Key: SPARK-28147 URL: https://issues.apache.org/jira/browse/SPARK-28147 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Peter Toth PostgreSQL supports RETURNING cause on INSERT/UPDATE/DELETE statements to return the modified rows. [https://www.postgresql.org/docs/9.5/dml-returning.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28146) Support IS OF () predicate
Peter Toth created SPARK-28146: -- Summary: Support IS OF () predicate Key: SPARK-28146 URL: https://issues.apache.org/jira/browse/SPARK-28146 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Peter Toth PostgreSQL supports IS OF () predicate, for example the following query is valid: {noformat} select 1 is of (int), true is of (bool) true true {noformat} I can't find PostgreSQL documentation about it, but here is how it works in Oracle: [https://docs.oracle.com/cd/B28359_01/server.111/b28286/conditions014.htm#SQLRF52157] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28145) Executor pods polling source can fail to replace dead executors
[ https://issues.apache.org/jira/browse/SPARK-28145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28145: Assignee: (was: Apache Spark) > Executor pods polling source can fail to replace dead executors > --- > > Key: SPARK-28145 > URL: https://issues.apache.org/jira/browse/SPARK-28145 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Onur Satici >Priority: Major > > Scheduled task responsible for reporting executor snapshots to the executor > allocator in kubernetes will die on any error, killing subsequent runs of the > same task. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28145) Executor pods polling source can fail to replace dead executors
[ https://issues.apache.org/jira/browse/SPARK-28145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28145: Assignee: Apache Spark > Executor pods polling source can fail to replace dead executors > --- > > Key: SPARK-28145 > URL: https://issues.apache.org/jira/browse/SPARK-28145 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Onur Satici >Assignee: Apache Spark >Priority: Major > > Scheduled task responsible for reporting executor snapshots to the executor > allocator in kubernetes will die on any error, killing subsequent runs of the > same task. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28145) Executor pods polling source can fail to replace dead executors
Onur Satici created SPARK-28145: --- Summary: Executor pods polling source can fail to replace dead executors Key: SPARK-28145 URL: https://issues.apache.org/jira/browse/SPARK-28145 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 2.4.3, 3.0.0 Reporter: Onur Satici Scheduled task responsible for reporting executor snapshots to the executor allocator in kubernetes will die on any error, killing subsequent runs of the same task. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27018: - Assignee: zhengruifeng > Checkpointed RDD deleted prematurely when using GBTClassifier > - > > Key: SPARK-27018 > URL: https://issues.apache.org/jira/browse/SPARK-27018 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core >Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0 > Environment: OS: Ubuntu Linux 18.10 > Java: java version "1.8.0_201" > Java(TM) SE Runtime Environment (build 1.8.0_201-b09) > Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode) > Reproducible with a single-node Spark in standalone mode. > Reproducible with Zepellin or Spark shell. > >Reporter: Piotr Kołaczkowski >Assignee: zhengruifeng >Priority: Major > Attachments: > Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch > > > Steps to reproduce: > {noformat} > import org.apache.spark.ml.linalg.Vectors > import org.apache.spark.ml.classification.GBTClassifier > case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int) > sc.setCheckpointDir("/checkpoints") > val trainingData = sc.parallelize(1 to 2426874, 256).map(x => > Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF > val classifier = new GBTClassifier() > .setLabelCol("label") > .setFeaturesCol("features") > .setProbabilityCol("probability") > .setMaxIter(100) > .setMaxDepth(10) > .setCheckpointInterval(2) > classifier.fit(trainingData){noformat} > > The last line fails with: > {noformat} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 > (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: > /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51 > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264) > at > com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31) > at > com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39) > at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269) > at > org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292) > at > org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd
[jira] [Resolved] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27018. --- Resolution: Fixed Fix Version/s: 2.4.4 2.3.4 3.0.0 Issue resolved by pull request 24870 [https://github.com/apache/spark/pull/24870] > Checkpointed RDD deleted prematurely when using GBTClassifier > - > > Key: SPARK-27018 > URL: https://issues.apache.org/jira/browse/SPARK-27018 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core >Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0 > Environment: OS: Ubuntu Linux 18.10 > Java: java version "1.8.0_201" > Java(TM) SE Runtime Environment (build 1.8.0_201-b09) > Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode) > Reproducible with a single-node Spark in standalone mode. > Reproducible with Zepellin or Spark shell. > >Reporter: Piotr Kołaczkowski >Assignee: zhengruifeng >Priority: Major > Fix For: 3.0.0, 2.3.4, 2.4.4 > > Attachments: > Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch > > > Steps to reproduce: > {noformat} > import org.apache.spark.ml.linalg.Vectors > import org.apache.spark.ml.classification.GBTClassifier > case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int) > sc.setCheckpointDir("/checkpoints") > val trainingData = sc.parallelize(1 to 2426874, 256).map(x => > Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF > val classifier = new GBTClassifier() > .setLabelCol("label") > .setFeaturesCol("features") > .setProbabilityCol("probability") > .setMaxIter(100) > .setMaxDepth(10) > .setCheckpointInterval(2) > classifier.fit(trainingData){noformat} > > The last line fails with: > {noformat} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 > (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: > /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51 > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264) > at > com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31) > at > com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39) > at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269) > at > org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292) > at > org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
[jira] [Assigned] (SPARK-27989) Add retries on the connection to the driver
[ https://issues.apache.org/jira/browse/SPARK-27989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27989: - Assignee: Jose Luis Pedrosa > Add retries on the connection to the driver > --- > > Key: SPARK-27989 > URL: https://issues.apache.org/jira/browse/SPARK-27989 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Jose Luis Pedrosa >Assignee: Jose Luis Pedrosa >Priority: Minor > > > Any failure in the executor when trying to connect to the driver, will make > impossible a connection from that process, which will trigger the creation of > another executor scheduled. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27989) Add retries on the connection to the driver
[ https://issues.apache.org/jira/browse/SPARK-27989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27989. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24702 [https://github.com/apache/spark/pull/24702] > Add retries on the connection to the driver > --- > > Key: SPARK-27989 > URL: https://issues.apache.org/jira/browse/SPARK-27989 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Jose Luis Pedrosa >Assignee: Jose Luis Pedrosa >Priority: Minor > Fix For: 3.0.0 > > > > Any failure in the executor when trying to connect to the driver, will make > impossible a connection from that process, which will trigger the creation of > another executor scheduled. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27546) Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone
[ https://issues.apache.org/jira/browse/SPARK-27546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871213#comment-16871213 ] Jiatao Tao commented on SPARK-27546: Hello [~dongjoon] Anyone ? > Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone > - > > Key: SPARK-27546 > URL: https://issues.apache.org/jira/browse/SPARK-27546 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jiatao Tao >Priority: Minor > Attachments: image-2019-04-23-08-10-00-475.png, > image-2019-04-23-08-10-50-247.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28142) KafkaContinuousStream ignores user configuration on CONSUMER_POLL_TIMEOUT
[ https://issues.apache.org/jira/browse/SPARK-28142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28142. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24942 [https://github.com/apache/spark/pull/24942] > KafkaContinuousStream ignores user configuration on CONSUMER_POLL_TIMEOUT > - > > Key: SPARK-28142 > URL: https://issues.apache.org/jira/browse/SPARK-28142 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > KafkaContinuousStream has a bug where {{pollTimeoutMs}} is always set to > default value, as the value of {{KafkaSourceProvider.CONSUMER_POLL_TIMEOUT}} > is {{kafkaConsumer.pollTimeoutMs}} which key-lowercased map has been provided > as {{sourceOptions}}. > This is due to the missing spot, which Map should be passed as > CaseInsensitiveStringMap as SPARK-27106. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28142) KafkaContinuousStream ignores user configuration on CONSUMER_POLL_TIMEOUT
[ https://issues.apache.org/jira/browse/SPARK-28142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28142: Assignee: Jungtaek Lim > KafkaContinuousStream ignores user configuration on CONSUMER_POLL_TIMEOUT > - > > Key: SPARK-28142 > URL: https://issues.apache.org/jira/browse/SPARK-28142 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > KafkaContinuousStream has a bug where {{pollTimeoutMs}} is always set to > default value, as the value of {{KafkaSourceProvider.CONSUMER_POLL_TIMEOUT}} > is {{kafkaConsumer.pollTimeoutMs}} which key-lowercased map has been provided > as {{sourceOptions}}. > This is due to the missing spot, which Map should be passed as > CaseInsensitiveStringMap as SPARK-27106. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6107) event log file ends with .inprogress should be able to display on webUI for standalone mode
[ https://issues.apache.org/jira/browse/SPARK-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871005#comment-16871005 ] leo.zhi edited comment on SPARK-6107 at 6/24/19 9:27 AM: - in version 2.2.0-cdh6.0.1 and yarn cluster, it still happens.:( was (Author: leo.zhi): in version 2.2.0-cdh6.0.1, it still happens.:( > event log file ends with .inprogress should be able to display on webUI for > standalone mode > --- > > Key: SPARK-6107 > URL: https://issues.apache.org/jira/browse/SPARK-6107 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.1 >Reporter: Zhang, Liye >Assignee: Zhang, Liye >Priority: Major > Fix For: 1.4.0 > > > when application is finished running abnormally (Ctrl + c for example), the > history event log file is still ends with *.inprogress* suffix. And the > application state can not be showed on webUI, User can just see "*Application > history not foud , Application xxx is still in progress*". > User should also can see the status of the abnormal finished applications. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6107) event log file ends with .inprogress should be able to display on webUI for standalone mode
[ https://issues.apache.org/jira/browse/SPARK-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871005#comment-16871005 ] leo.zhi commented on SPARK-6107: in version 2.2.0-cdh6.0.1, it still happens.:( > event log file ends with .inprogress should be able to display on webUI for > standalone mode > --- > > Key: SPARK-6107 > URL: https://issues.apache.org/jira/browse/SPARK-6107 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.1 >Reporter: Zhang, Liye >Assignee: Zhang, Liye >Priority: Major > Fix For: 1.4.0 > > > when application is finished running abnormally (Ctrl + c for example), the > history event log file is still ends with *.inprogress* suffix. And the > application state can not be showed on webUI, User can just see "*Application > history not foud , Application xxx is still in progress*". > User should also can see the status of the abnormal finished applications. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28144) Remove ZKUtils from Kafka tests
[ https://issues.apache.org/jira/browse/SPARK-28144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870961#comment-16870961 ] Gabor Somogyi commented on SPARK-28144: --- ZookeeperClient has "private [kafka]" visibility and with reflection it can be used but I think adding another hacky solution which maybe doesn't reflect how Kafka addresses this issue is bad in general. > Remove ZKUtils from Kafka tests > --- > > Key: SPARK-28144 > URL: https://issues.apache.org/jira/browse/SPARK-28144 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming, Tests >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > ZKUtils is deprecated from Kafka version 2.0.0 so it would be good to replace. > I've taken a look at the possibilities but seems like there is no working > alternative at the moment. Please see KAFKA-8468. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28144) Remove ZKUtils from Kafka tests
[ https://issues.apache.org/jira/browse/SPARK-28144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870959#comment-16870959 ] Gabor Somogyi commented on SPARK-28144: --- I've made the implementation but it doesn't work until the mentioned bug is not fixed. > Remove ZKUtils from Kafka tests > --- > > Key: SPARK-28144 > URL: https://issues.apache.org/jira/browse/SPARK-28144 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming, Tests >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > ZKUtils is deprecated from Kafka version 2.0.0 so it would be good to replace. > I've taken a look at the possibilities but seems like there is no working > alternative at the moment. Please see KAFKA-8468. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28144) Remove ZKUtils from Kafka tests
Gabor Somogyi created SPARK-28144: - Summary: Remove ZKUtils from Kafka tests Key: SPARK-28144 URL: https://issues.apache.org/jira/browse/SPARK-28144 Project: Spark Issue Type: Improvement Components: Structured Streaming, Tests Affects Versions: 3.0.0 Reporter: Gabor Somogyi ZKUtils is deprecated from Kafka version 2.0.0 so it would be good to replace. I've taken a look at the possibilities but seems like there is no working alternative at the moment. Please see KAFKA-8468. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled
[ https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870880#comment-16870880 ] Marco Gaido commented on SPARK-28067: - No, it is the same. Are you sure about your configs? {code} macmarco:spark mark9$ git log -5 --oneline 5ad1053f3e (HEAD, apache/master) [SPARK-28128][PYTHON][SQL] Pandas Grouped UDFs skip empty partitions 113f8c8d13 [SPARK-28132][PYTHON] Update document type conversion for Pandas UDFs (pyarrow 0.13.0, pandas 0.24.2, Python 3.7) 9b9d81b821 [SPARK-28131][PYTHON] Update document type conversion between Python data and SQL types in normal UDFs (Python 3.7) 54da3bbfb2 [SPARK-28127][SQL] Micro optimization on TreeNode's mapChildren method 47f54b1ec7 [SPARK-28118][CORE] Add `spark.eventLog.compression.codec` configuration macmarco:spark mark9$ ./bin/spark-shell 19/06/24 09:17:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://.:4040 Spark context available as 'sc' (master = local[*], app id = local-1561360686725). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152) Type in expressions to have them evaluated. Type :help for more information. scala> val df = Seq( | (BigDecimal("1000"), 1), | (BigDecimal("1000"), 1), | (BigDecimal("1000"), 2), | (BigDecimal("1000"), 2), | (BigDecimal("1000"), 2), | (BigDecimal("1000"), 2), | (BigDecimal("1000"), 2), | (BigDecimal("1000"), 2), | (BigDecimal("1000"), 2), | (BigDecimal("1000"), 2), | (BigDecimal("1000"), 2), | (BigDecimal("1000"), 2)).toDF("decNum", "intNum") df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int] scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum")) df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)] scala> df2.explain == Physical Plan == *(2) HashAggregate(keys=[], functions=[sum(decNum#14)]) +- Exchange SinglePartition +- *(1) HashAggregate(keys=[], functions=[partial_sum(decNum#14)]) +- *(1) Project [decNum#14] +- *(1) BroadcastHashJoin [intNum#8], [intNum#15], Inner, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- LocalTableScan [intNum#8] +- LocalTableScan [decNum#14, intNum#15] scala> df2.show(40,false) +---+ |sum(decNum)| +---+ |null | +---+ {code} > Incorrect results in decimal aggregation with whole-stage code gen enabled > -- > > Key: SPARK-28067 > URL: https://issues.apache.org/jira/browse/SPARK-28067 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.0 > Environment: Ubuntu LTS 16.04 > Oracle Java 1.8.0_201 > spark-2.4.3-bin-without-hadoop > spark-shell >Reporter: Mark Sirek >Priority: Minor > Labels: correctness > > The following test case involving a join followed by a sum aggregation > returns the wrong answer for the sum: > > {code:java} > val df = Seq( > (BigDecimal("1000"), 1), > (BigDecimal("1000"), 1), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2)).toDF("decNum", "intNum") > val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, > "intNum").agg(sum("decNum")) > scala> df2.show(40,false) > --- > sum(decNum) > --- > 4000.00 > --- > > {code} > > The result should be 1040