[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics

2019-06-24 Thread Luca Canali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872075#comment-16872075
 ] 

Luca Canali commented on SPARK-28091:
-

Thank you [~ste...@apache.org] for your comment and clarifications. Indeed what 
I am trying to do is to collect executor-level metrics for S3A (and also for 
other Hadoop compatible filesystems of interest). The goal is to bring the 
metrics into the Spark metrics system, so that they can be used, for example, 
in a performance dashboard and displayed together with the rest of the 
instrumentation metrics.

The original work for this started from the need of measuring I/O metrics for a 
custom HDFS-compatible filesystem that we use (called ROOT:) and more recently 
also for S3A. The first implementation we did was a simple: a small change in 
[[ExecutorSource]], which already has code to collect metrics for "hdfs" and 
"file"/local filesystems at the executor level, obviously that code is very 
easy to extend, however it feels like a short-term hack going that way.

My though with this PR is to provide a flexible method to add instrumentation, 
profiting from our current use case related to I/O workload monitoring, but 
also open to several other use cases. 
I am also quite interested to see developmnets in this area for CPU counters 
and possibly also GPU-related instrumentation.
I think the proposal to use executor plugins for this goes in the original 
direction outlined by [~irashid] and collaborators with SPARK-24918

I add some links to reference and related material: code of a few test executor 
metrics plugins that I am developing: 
[https://github.com/cerndb/SparkExecutorPlugins] 
The general idea of how to build a dashboard with Spark metrics is described in 
[https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark]

> Extend Spark metrics system with executor plugin metrics
> 
>
> Key: SPARK-28091
> URL: https://issues.apache.org/jira/browse/SPARK-28091
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to improve Spark instrumentation by adding a hook for Spark 
> executor plugin metrics to the Spark metrics systems implemented with the 
> Dropwizard/Codahale library.
> Context: The Spark metrics system provides a large variety of metrics, see 
> also SPARK-26890, useful to  monitor and troubleshoot Spark workloads. A 
> typical workflow is to sink the metrics to a storage system and build 
> dashboards on top of that.
> Improvement: The original goal of this work was to add instrumentation for S3 
> filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
> instruments HDFS and local filesystem metrics. Rather than extending the code 
> there, we proposes to add a metrics plugin system which is of more flexible 
> and general use.
> Advantages:
>  * The metric plugin system makes it easy to implement instrumentation for S3 
> access by Spark jobs.
>  * The metrics plugin system allows for easy extensions of how Spark collects 
> HDFS-related workload metrics. This is currently done using the Hadoop 
> Filesystem GetAllStatistics method, which is deprecated in recent versions of 
> Hadoop. Recent versions of Hadoop Filesystem recommend using method 
> GetGlobalStorageStatistics, which also provides several additional metrics. 
> GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been 
> introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an 
> easy way to “opt in” using such new API calls for those deploying suitable 
> Hadoop versions.
>  * We also have the use case of adding Hadoop filesystem monitoring for a 
> custom Hadoop compliant filesystem in use in our organization (EOS using the 
> XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
> Others may have similar use cases.
>  * More generally, this method makes it straightforward to plug in Filesystem 
> and other metrics to the Spark monitoring system. Future work on plugin 
> implementation can address extending monitoring to measure usage of external 
> resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
> not normally be considered general enough for inclusion in Apache Spark code, 
> but that can be nevertheless useful for specialized use cases, tests or 
> troubleshooting.
> Implementation:
> The proposed implementation is currently a WIP open for comments and 
> improvements. It is based on the work on Executor Plugin of SPARK-24918 and 
> builds on recent work on extending Spark executor metrics, such as SPARK-25228
> Tests and examples:
> This has been so far manually tested running Spark on YARN and K8S cluster

[jira] [Assigned] (SPARK-28149) Disable negeative DNS caching

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28149:


Assignee: (was: Apache Spark)

> Disable negeative DNS caching
> -
>
> Key: SPARK-28149
> URL: https://issues.apache.org/jira/browse/SPARK-28149
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Jose Luis Pedrosa
>Priority: Minor
>
> By default JVM caches the failures for the DNS resolutions, by default is 
> cached by 10 seconds.
> Alpine JDK used in the images for kubernetes has a default timout of 5 
> seconds.
> This means that in clusters with slow init time (network sidecar pods, slow 
> network start up) executor will never run, because the first attempt to 
> connect to the driver will fail, and that failure will be cached, causing  
> the retries to happen in a tight loop without actually trying again.
>  
> The proposed implementation would be to add to the entrypoint.sh (that is 
> exclusive for k8s) to alter the file with the dns caching, and disable it if 
> there's an environment variable as "DISABLE_DNS_NEGATIVE_CACHING" defined. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28149) Disable negeative DNS caching

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28149:


Assignee: Apache Spark

> Disable negeative DNS caching
> -
>
> Key: SPARK-28149
> URL: https://issues.apache.org/jira/browse/SPARK-28149
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Jose Luis Pedrosa
>Assignee: Apache Spark
>Priority: Minor
>
> By default JVM caches the failures for the DNS resolutions, by default is 
> cached by 10 seconds.
> Alpine JDK used in the images for kubernetes has a default timout of 5 
> seconds.
> This means that in clusters with slow init time (network sidecar pods, slow 
> network start up) executor will never run, because the first attempt to 
> connect to the driver will fail, and that failure will be cached, causing  
> the retries to happen in a tight loop without actually trying again.
>  
> The proposed implementation would be to add to the entrypoint.sh (that is 
> exclusive for k8s) to alter the file with the dns caching, and disable it if 
> there's an environment variable as "DISABLE_DNS_NEGATIVE_CACHING" defined. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28158) Hive UDFs supports UDT type

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28158:


Assignee: (was: Apache Spark)

> Hive UDFs supports UDT type
> ---
>
> Key: SPARK-28158
> URL: https://issues.apache.org/jira/browse/SPARK-28158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28158) Hive UDFs supports UDT type

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28158:


Assignee: Apache Spark

> Hive UDFs supports UDT type
> ---
>
> Key: SPARK-28158
> URL: https://issues.apache.org/jira/browse/SPARK-28158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Genmao Yu
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28158) Hive UDFs supports UDT type

2019-06-24 Thread Genmao Yu (JIRA)
Genmao Yu created SPARK-28158:
-

 Summary: Hive UDFs supports UDT type
 Key: SPARK-28158
 URL: https://issues.apache.org/jira/browse/SPARK-28158
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3, 3.0.0
Reporter: Genmao Yu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28157) Make SHS check Spark event log file permission changes

2019-06-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28157.
---
Resolution: Invalid

My bad. This issue is invalid.

> Make SHS check Spark event log file permission changes
> --
>
> Key: SPARK-28157
> URL: https://issues.apache.org/jira/browse/SPARK-28157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Priority: Major
>
> At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
> the file system, and maintains a permanent blacklist for all event log files 
> failed once at reading. Although this reduces a lot of invalid accesses, 
> there is no way to see this log files back after the permissions are 
> recovered correctly. The only way has been restarting SHS.
> Apache Spark is unable to know the permission recovery. However, we had 
> better give a second chances for those blacklisted files in a regular manner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28036) Built-in udf left/right has inconsistent behavior

2019-06-24 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872050#comment-16872050
 ] 

Yuming Wang commented on SPARK-28036:
-


{code:sql}
[root@spark-3267648 spark-3.0.0-SNAPSHOT-bin-3.2.0]# bin/spark-shell
19/06/24 23:10:12 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://spark-3267648.lvs02.dev.ebayc3.com:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1561443018277).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
  /_/

Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select left('ahoj', -2), right('ahoj', -2)").show
++-+
|left('ahoj', -2)|right('ahoj', -2)|
++-+
|| |
++-+


scala> spark.sql("select left('ahoj', 2), right('ahoj', 2)").show
+---++
|left('ahoj', 2)|right('ahoj', 2)|
+---++
| ah|  oj|
+---++
{code}

> Built-in udf left/right has inconsistent behavior
> -
>
> Key: SPARK-28036
> URL: https://issues.apache.org/jira/browse/SPARK-28036
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> PostgreSQL:
> {code:sql}
> postgres=# select left('ahoj', -2), right('ahoj', -2);
>  left | right 
> --+---
>  ah   | oj
> (1 row)
> {code}
> Spark SQL:
> {code:sql}
> spark-sql> select left('ahoj', -2), right('ahoj', -2);
> spark-sql>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28036) Built-in udf left/right has inconsistent behavior

2019-06-24 Thread Shivu Sondur (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872045#comment-16872045
 ] 

Shivu Sondur commented on SPARK-28036:
--

[~yumwang]
select left('ahoj', 2), right('ahoj', 2);
use with '-' sign, it will work fine,

i tested in latest spark

 

> Built-in udf left/right has inconsistent behavior
> -
>
> Key: SPARK-28036
> URL: https://issues.apache.org/jira/browse/SPARK-28036
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> PostgreSQL:
> {code:sql}
> postgres=# select left('ahoj', -2), right('ahoj', -2);
>  left | right 
> --+---
>  ah   | oj
> (1 row)
> {code}
> Spark SQL:
> {code:sql}
> spark-sql> select left('ahoj', -2), right('ahoj', -2);
> spark-sql>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28156) Join plan sometimes does not use cached query

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28156:


Assignee: (was: Apache Spark)

> Join plan sometimes does not use cached query
> -
>
> Key: SPARK-28156
> URL: https://issues.apache.org/jira/browse/SPARK-28156
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Bruce Robbins
>Priority: Major
>
> I came across a case where a cached query is referenced on both sides of a 
> join, but the InMemoryRelation is inserted on only one side. This case occurs 
> only when the cached query uses a (Hive-style) view.
> Consider this example:
> {noformat}
> // create the data
> val df1 = Seq.tabulate(10) { x => (x, x + 1, x + 2, x + 3) }.toDF("a", "b", 
> "c", "d")
> df1.write.mode("overwrite").format("orc").saveAsTable("table1")
> sql("drop view if exists table1_vw")
> sql("create view table1_vw as select * from table1")
> // create the cached query
> val cacheddataDf = sql("""
> select a, b, c, d
> from table1_vw
> """)
> import org.apache.spark.storage.StorageLevel.DISK_ONLY
> cacheddataDf.createOrReplaceTempView("cacheddata")
> cacheddataDf.persist(DISK_ONLY)
> // main query
> val queryDf = sql(s"""
> select leftside.a, leftside.b
> from cacheddata leftside
> join cacheddata rightside
> on leftside.a = rightside.a
> """)
> queryDf.explain(true)
> {noformat}
> Note that the optimized plan does not use an InMemoryRelation for the right 
> side, but instead just uses a Relation:
> {noformat}
> Project [a#45, b#46]
> +- Join Inner, (a#45 = a#37)
>:- Project [a#45, b#46]
>:  +- Filter isnotnull(a#45)
>: +- InMemoryRelation [a#45, b#46, c#47, d#48], StorageLevel(disk, 1 
> replicas)
>:   +- *(1) FileScan orc default.table1[a#37,b#38,c#39,d#40] 
> Batched: true, DataFilters: [], Format: ORC, Location: 
> InMemoryFileIndex[file:/Users/brobbins/github/spark_upstream/spark-warehouse/table1],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
>+- Project [a#37]
>   +- Filter isnotnull(a#37)
>  +- Relation[a#37,b#38,c#39,d#40] orc
> {noformat}
> The fragment does not match the cached query because AliasViewChild adds an 
> extra projection under the View on the right side (see #2 below).
> AliasViewChild adds the extra projection because the exprIds in the View's 
> output appears to have been renamed by Analyzer$ResolveReferences (#1 below). 
> I have not yet looked at why.
> {noformat}
> -
> -
> -
>+- SubqueryAlias `rightside`
>   +- SubqueryAlias `cacheddata`
>  +- Project [a#73, b#74, c#75, d#76]
> +- SubqueryAlias `default`.`table1_vw`
> (#1) ->+- View (`default`.`table1_vw`, [a#73,b#74,c#75,d#76])
> (#2) ->   +- Project [cast(a#45 as int) AS a#73, cast(b#46 as int) AS 
> b#74, cast(c#47 as int) AS c#75, cast(d#48 as int) AS d#76]
>  +- Project [cast(a#37 as int) AS a#45, cast(b#38 as int) 
> AS b#46, cast(c#39 as int) AS c#47, cast(d#40 as int) AS d#48]
> +- Project [a#37, b#38, c#39, d#40]
>+- SubqueryAlias `default`.`table1`
>   +- Relation[a#37,b#38,c#39,d#40] orc
> {noformat}
> In a larger query (where cachedata may be referred on either side only 
> indirectly), this phenomenon can create certain oddities, as the fragment is 
> not replaced with InMemoryRelation, and the fragment is present when the plan 
> is optimized as a whole.
> In Spark 2.1.3, Spark uses InMemoryRelation on both sides.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28156) Join plan sometimes does not use cached query

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28156:


Assignee: Apache Spark

> Join plan sometimes does not use cached query
> -
>
> Key: SPARK-28156
> URL: https://issues.apache.org/jira/browse/SPARK-28156
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> I came across a case where a cached query is referenced on both sides of a 
> join, but the InMemoryRelation is inserted on only one side. This case occurs 
> only when the cached query uses a (Hive-style) view.
> Consider this example:
> {noformat}
> // create the data
> val df1 = Seq.tabulate(10) { x => (x, x + 1, x + 2, x + 3) }.toDF("a", "b", 
> "c", "d")
> df1.write.mode("overwrite").format("orc").saveAsTable("table1")
> sql("drop view if exists table1_vw")
> sql("create view table1_vw as select * from table1")
> // create the cached query
> val cacheddataDf = sql("""
> select a, b, c, d
> from table1_vw
> """)
> import org.apache.spark.storage.StorageLevel.DISK_ONLY
> cacheddataDf.createOrReplaceTempView("cacheddata")
> cacheddataDf.persist(DISK_ONLY)
> // main query
> val queryDf = sql(s"""
> select leftside.a, leftside.b
> from cacheddata leftside
> join cacheddata rightside
> on leftside.a = rightside.a
> """)
> queryDf.explain(true)
> {noformat}
> Note that the optimized plan does not use an InMemoryRelation for the right 
> side, but instead just uses a Relation:
> {noformat}
> Project [a#45, b#46]
> +- Join Inner, (a#45 = a#37)
>:- Project [a#45, b#46]
>:  +- Filter isnotnull(a#45)
>: +- InMemoryRelation [a#45, b#46, c#47, d#48], StorageLevel(disk, 1 
> replicas)
>:   +- *(1) FileScan orc default.table1[a#37,b#38,c#39,d#40] 
> Batched: true, DataFilters: [], Format: ORC, Location: 
> InMemoryFileIndex[file:/Users/brobbins/github/spark_upstream/spark-warehouse/table1],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
>+- Project [a#37]
>   +- Filter isnotnull(a#37)
>  +- Relation[a#37,b#38,c#39,d#40] orc
> {noformat}
> The fragment does not match the cached query because AliasViewChild adds an 
> extra projection under the View on the right side (see #2 below).
> AliasViewChild adds the extra projection because the exprIds in the View's 
> output appears to have been renamed by Analyzer$ResolveReferences (#1 below). 
> I have not yet looked at why.
> {noformat}
> -
> -
> -
>+- SubqueryAlias `rightside`
>   +- SubqueryAlias `cacheddata`
>  +- Project [a#73, b#74, c#75, d#76]
> +- SubqueryAlias `default`.`table1_vw`
> (#1) ->+- View (`default`.`table1_vw`, [a#73,b#74,c#75,d#76])
> (#2) ->   +- Project [cast(a#45 as int) AS a#73, cast(b#46 as int) AS 
> b#74, cast(c#47 as int) AS c#75, cast(d#48 as int) AS d#76]
>  +- Project [cast(a#37 as int) AS a#45, cast(b#38 as int) 
> AS b#46, cast(c#39 as int) AS c#47, cast(d#40 as int) AS d#48]
> +- Project [a#37, b#38, c#39, d#40]
>+- SubqueryAlias `default`.`table1`
>   +- Relation[a#37,b#38,c#39,d#40] orc
> {noformat}
> In a larger query (where cachedata may be referred on either side only 
> indirectly), this phenomenon can create certain oddities, as the fragment is 
> not replaced with InMemoryRelation, and the fragment is present when the plan 
> is optimized as a whole.
> In Spark 2.1.3, Spark uses InMemoryRelation on both sides.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28157) Make SHS check Spark event log file permission changes

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28157:


Assignee: (was: Apache Spark)

> Make SHS check Spark event log file permission changes
> --
>
> Key: SPARK-28157
> URL: https://issues.apache.org/jira/browse/SPARK-28157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Priority: Major
>
> At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
> the file system, and maintains a permanent blacklist for all event log files 
> failed once at reading. Although this reduces a lot of invalid accesses, 
> there is no way to see this log files back after the permissions are 
> recovered correctly. The only way has been restarting SHS.
> Apache Spark is unable to know the permission recovery. However, we had 
> better give a second chances for those blacklisted files in a regular manner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28157) Make SHS check Spark event log file permission changes

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28157:


Assignee: Apache Spark

> Make SHS check Spark event log file permission changes
> --
>
> Key: SPARK-28157
> URL: https://issues.apache.org/jira/browse/SPARK-28157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
> the file system, and maintains a permanent blacklist for all event log files 
> failed once at reading. Although this reduces a lot of invalid accesses, 
> there is no way to see this log files back after the permissions are 
> recovered correctly. The only way has been restarting SHS.
> Apache Spark is unable to know the permission recovery. However, we had 
> better give a second chances for those blacklisted files in a regular manner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28157) Make SHS check Spark event log file permission changes

2019-06-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28157:
--
Description: 
At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
the file system, and maintains a permanent blacklist for all event log files 
failed once at reading. Although this reduces a lot of invalid accesses, there 
is no way to see this log files back after the permissions are recovered 
correctly. The only way has been restarting SHS.

Apache Spark is unable to know the permission recovery. However, we had better 
give a second chances for those blacklisted files in a regular manner.

  was:
Since Spark 2.4.0, SPARK-24948 delegated access permission checks to the file 
system, and maintains a permanent blacklist for all event log files failed once 
at reading. Although this reduces a lot of invalid accesses, there is no way to 
see this log files back after the permissions are recovered correctly. The only 
way has been restarting SHS.

Apache Spark is unable to know the permission recovery. However, we had better 
give a second chances for those blacklisted files in a regular manner.


> Make SHS check Spark event log file permission changes
> --
>
> Key: SPARK-28157
> URL: https://issues.apache.org/jira/browse/SPARK-28157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Priority: Major
>
> At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
> the file system, and maintains a permanent blacklist for all event log files 
> failed once at reading. Although this reduces a lot of invalid accesses, 
> there is no way to see this log files back after the permissions are 
> recovered correctly. The only way has been restarting SHS.
> Apache Spark is unable to know the permission recovery. However, we had 
> better give a second chances for those blacklisted files in a regular manner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28157) Make SHS check Spark event log file permission changes

2019-06-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28157:
--
Affects Version/s: 2.4.0
   2.4.1
   2.4.2
   2.4.3

> Make SHS check Spark event log file permission changes
> --
>
> Key: SPARK-28157
> URL: https://issues.apache.org/jira/browse/SPARK-28157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since Spark 2.4.0, SPARK-24948 delegated access permission checks to the file 
> system, and maintains a permanent blacklist for all event log files failed 
> once at reading. Although this reduces a lot of invalid accesses, there is no 
> way to see this log files back after the permissions are recovered correctly. 
> The only way has been restarting SHS.
> Apache Spark is unable to know the permission recovery. However, we had 
> better give a second chances for those blacklisted files in a regular manner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28157) Make SHS check Spark event log file permission changes

2019-06-24 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-28157:
-

 Summary: Make SHS check Spark event log file permission changes
 Key: SPARK-28157
 URL: https://issues.apache.org/jira/browse/SPARK-28157
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


Since Spark 2.4.0, SPARK-24948 delegated access permission checks to the file 
system, and maintains a permanent blacklist for all event log files failed once 
at reading. Although this reduces a lot of invalid accesses, there is no way to 
see this log files back after the permissions are recovered correctly. The only 
way has been restarting SHS.

Apache Spark is unable to know the permission recovery. However, we had better 
give a second chances for those blacklisted files in a regular manner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28135) ceil/ceiling/floor/power returns incorrect values

2019-06-24 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872015#comment-16872015
 ] 

Tony Zhang commented on SPARK-28135:


HIVE-21916

> ceil/ceiling/floor/power returns incorrect values
> -
>
> Key: SPARK-28135
> URL: https://issues.apache.org/jira/browse/SPARK-28135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> spark-sql> select ceil(double(1.2345678901234e+200)), 
> ceiling(double(1.2345678901234e+200)), floor(double(1.2345678901234e+200)), 
> power('1', 'NaN');
> 9223372036854775807   9223372036854775807 9223372036854775807 NaN
> {noformat}
> {noformat}
> postgres=# select ceil(1.2345678901234e+200::float8), 
> ceiling(1.2345678901234e+200::float8), floor(1.2345678901234e+200::float8), 
> power('1', 'NaN');
>  ceil |   ceiling|floor | power
> --+--+--+---
>  1.2345678901234e+200 | 1.2345678901234e+200 | 1.2345678901234e+200 | 1
> (1 row)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28153) input_file_name doesn't work with Python UDF in the same project

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28153:


Assignee: Apache Spark

> input_file_name doesn't work with Python UDF in the same project
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28153) input_file_name doesn't work with Python UDF in the same project

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28153:


Assignee: (was: Apache Spark)

> input_file_name doesn't work with Python UDF in the same project
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError

2019-06-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871976#comment-16871976
 ] 

Apache Spark commented on SPARK-27100:
--

User 'parthchandra' has created a pull request for this issue:
https://github.com/apache/spark/pull/24957

> dag-scheduler-event-loop" java.lang.StackOverflowError
> --
>
> Key: SPARK-27100
> URL: https://issues.apache.org/jira/browse/SPARK-27100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.3.3
>Reporter: KaiXu
>Priority: Major
> Attachments: SPARK-27100-Overflow.txt, stderr
>
>
> ALS in Spark MLlib causes StackOverflow:
>  /opt/sparkml/spark213/bin/spark-submit  --properties-file 
> /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class 
> com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client 
> --num-executors 3 --executor-memory 322g 
> /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar
>  --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 
> --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 
> --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input
>  
> Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28108) Simplify OrcFilters

2019-06-24 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28108.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24910
[https://github.com/apache/spark/pull/24910]

> Simplify OrcFilters
> ---
>
> Key: SPARK-28108
> URL: https://issues.apache.org/jira/browse/SPARK-28108
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> In #24068, @IvanVergiliev reports that OrcFilters.createBuilder has 
> exponential complexity in the height of the filter tree due to the way the 
> check-and-build pattern is implemented.
> This is because the same method createBuilder is called twice recursively for 
> any children under And/Or/Not nodes, so that inside the first call, the 
> second call is called as well(See description in #24068 for details).
> Comparing to the approach in #24068, I propose a very simple solution for the 
> issue. We can rely on the result of convertibleFilters, which can build a 
> fully convertible tree. With it, we don't need to concern about the children 
> of a certain node is not convertible in method createBuilder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28108) Simplify OrcFilters

2019-06-24 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28108:
---

Assignee: Gengliang Wang

> Simplify OrcFilters
> ---
>
> Key: SPARK-28108
> URL: https://issues.apache.org/jira/browse/SPARK-28108
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> In #24068, @IvanVergiliev reports that OrcFilters.createBuilder has 
> exponential complexity in the height of the filter tree due to the way the 
> check-and-build pattern is implemented.
> This is because the same method createBuilder is called twice recursively for 
> any children under And/Or/Not nodes, so that inside the first call, the 
> second call is called as well(See description in #24068 for details).
> Comparing to the approach in #24068, I propose a very simple solution for the 
> issue. We can rely on the result of convertibleFilters, which can build a 
> fully convertible tree. With it, we don't need to concern about the children 
> of a certain node is not convertible in method createBuilder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28114) Add Jenkins job for `Hadoop-3.2` profile

2019-06-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871948#comment-16871948
 ] 

Dongjoon Hyun commented on SPARK-28114:
---

Although I'm a read-only Jenkins member, I added new three jobs (except 
`compilation job`) to `Spark QA Test (Dashboard)`, [~shaneknapp].

> Add Jenkins job for `Hadoop-3.2` profile
> 
>
> Key: SPARK-28114
> URL: https://issues.apache.org/jira/browse/SPARK-28114
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: shane knapp
>Priority: Major
>
> Spark 3.0 is a major version change. We want to have the following new Jobs.
> 1. SBT with hadoop-3.2
> 2. Maven with hadoop-3.2 (on JDK8 and JDK11)
> Also, shall we have a limit for the concurrent run for the following existing 
> job? Currently, it invokes multiple jobs concurrently. We can save the 
> resource by limiting to 1 like the other jobs.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing
> We will drop four `branch-2.3` jobs at the end of August, 2019.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28114) Add Jenkins job for `Hadoop-3.2` profile

2019-06-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871944#comment-16871944
 ] 

Dongjoon Hyun commented on SPARK-28114:
---

Thank you so much, [~shaneknapp] and [~srowen].
The new four jobs works as expected.

The job showed `--force` weirdly at that failure like the following, but 
`hadoop-1` profile removal seems to make it work back.
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6509/console
{code}
+ [[ hadoop-2.7 == hadoop-1 ]]
+ build/mvn --force -DzincPort=3245 -DskipTests -Phadoop-2.7 -Pyarn -Phive 
-Phive-thriftserver -Pkinesis-asl -Pmesos clean package
{code}

> Add Jenkins job for `Hadoop-3.2` profile
> 
>
> Key: SPARK-28114
> URL: https://issues.apache.org/jira/browse/SPARK-28114
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: shane knapp
>Priority: Major
>
> Spark 3.0 is a major version change. We want to have the following new Jobs.
> 1. SBT with hadoop-3.2
> 2. Maven with hadoop-3.2 (on JDK8 and JDK11)
> Also, shall we have a limit for the concurrent run for the following existing 
> job? Currently, it invokes multiple jobs concurrently. We can save the 
> resource by limiting to 1 like the other jobs.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing
> We will drop four `branch-2.3` jobs at the end of August, 2019.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27815) do not leak SaveMode to file source v2

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27815:


Assignee: Apache Spark

> do not leak SaveMode to file source v2
> --
>
> Key: SPARK-27815
> URL: https://issues.apache.org/jira/browse/SPARK-27815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Blocker
>
> Currently there is a hack in `DataFrameWriter`, which passes `SaveMode` to 
> file source v2. This should be removed and file source v2 should not accept 
> SaveMode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27815) do not leak SaveMode to file source v2

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27815:


Assignee: (was: Apache Spark)

> do not leak SaveMode to file source v2
> --
>
> Key: SPARK-27815
> URL: https://issues.apache.org/jira/browse/SPARK-27815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> Currently there is a hack in `DataFrameWriter`, which passes `SaveMode` to 
> file source v2. This should be removed and file source v2 should not accept 
> SaveMode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28156) Join plan sometimes does not use cached query

2019-06-24 Thread Bruce Robbins (JIRA)
Bruce Robbins created SPARK-28156:
-

 Summary: Join plan sometimes does not use cached query
 Key: SPARK-28156
 URL: https://issues.apache.org/jira/browse/SPARK-28156
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3, 2.3.3, 3.0.0
Reporter: Bruce Robbins


I came across a case where a cached query is referenced on both sides of a 
join, but the InMemoryRelation is inserted on only one side. This case occurs 
only when the cached query uses a (Hive-style) view.

Consider this example:
{noformat}
// create the data
val df1 = Seq.tabulate(10) { x => (x, x + 1, x + 2, x + 3) }.toDF("a", "b", 
"c", "d")
df1.write.mode("overwrite").format("orc").saveAsTable("table1")
sql("drop view if exists table1_vw")
sql("create view table1_vw as select * from table1")

// create the cached query
val cacheddataDf = sql("""
select a, b, c, d
from table1_vw
""")

import org.apache.spark.storage.StorageLevel.DISK_ONLY
cacheddataDf.createOrReplaceTempView("cacheddata")
cacheddataDf.persist(DISK_ONLY)

// main query
val queryDf = sql(s"""
select leftside.a, leftside.b
from cacheddata leftside
join cacheddata rightside
on leftside.a = rightside.a
""")

queryDf.explain(true)
{noformat}
Note that the optimized plan does not use an InMemoryRelation for the right 
side, but instead just uses a Relation:
{noformat}
Project [a#45, b#46]
+- Join Inner, (a#45 = a#37)
   :- Project [a#45, b#46]
   :  +- Filter isnotnull(a#45)
   : +- InMemoryRelation [a#45, b#46, c#47, d#48], StorageLevel(disk, 1 
replicas)
   :   +- *(1) FileScan orc default.table1[a#37,b#38,c#39,d#40] 
Batched: true, DataFilters: [], Format: ORC, Location: 
InMemoryFileIndex[file:/Users/brobbins/github/spark_upstream/spark-warehouse/table1],
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct
   +- Project [a#37]
  +- Filter isnotnull(a#37)
 +- Relation[a#37,b#38,c#39,d#40] orc

{noformat}
The fragment does not match the cached query because AliasViewChild adds an 
extra projection under the View on the right side (see #2 below).

AliasViewChild adds the extra projection because the exprIds in the View's 
output appears to have been renamed by Analyzer$ResolveReferences (#1 below). I 
have not yet looked at why.
{noformat}
-
-
-
   +- SubqueryAlias `rightside`
  +- SubqueryAlias `cacheddata`
 +- Project [a#73, b#74, c#75, d#76]
+- SubqueryAlias `default`.`table1_vw`
(#1) ->+- View (`default`.`table1_vw`, [a#73,b#74,c#75,d#76])
(#2) ->   +- Project [cast(a#45 as int) AS a#73, cast(b#46 as int) AS 
b#74, cast(c#47 as int) AS c#75, cast(d#48 as int) AS d#76]
 +- Project [cast(a#37 as int) AS a#45, cast(b#38 as int) 
AS b#46, cast(c#39 as int) AS c#47, cast(d#40 as int) AS d#48]
+- Project [a#37, b#38, c#39, d#40]
   +- SubqueryAlias `default`.`table1`
  +- Relation[a#37,b#38,c#39,d#40] orc

{noformat}
In a larger query (where cachedata may be referred on either side only 
indirectly), this phenomenon can create certain oddities, as the fragment is 
not replaced with InMemoryRelation, and the fragment is present when the plan 
is optimized as a whole.

In Spark 2.1.3, Spark uses InMemoryRelation on both sides.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28155) Improve SQL optimizer's predicate pushdown performance for cascading joins

2019-06-24 Thread Yesheng Ma (JIRA)
Yesheng Ma created SPARK-28155:
--

 Summary: Improve SQL optimizer's predicate pushdown performance 
for cascading joins
 Key: SPARK-28155
 URL: https://issues.apache.org/jira/browse/SPARK-28155
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yesheng Ma


The current catalyst optimizer's predicate pushdown is divided into two 
separate rules: PushDownPredicate and PushThroughJoin. This is not efficient 
for optimizing cascading joins such as TPC-DS q64, where a whole default batch 
is re-executed just due to this. We need a more efficient approach to pushdown 
predicate as much as possible in a single pass.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28154) GMM fix double caching

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28154:


Assignee: (was: Apache Spark)

> GMM fix double caching
> --
>
> Key: SPARK-28154
> URL: https://issues.apache.org/jira/browse/SPARK-28154
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> The intermediate rdd is always cached.  We should only cache it if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28154) GMM fix double caching

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28154:


Assignee: Apache Spark

> GMM fix double caching
> --
>
> Key: SPARK-28154
> URL: https://issues.apache.org/jira/browse/SPARK-28154
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> The intermediate rdd is always cached.  We should only cache it if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28154) GMM fix double caching

2019-06-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871909#comment-16871909
 ] 

Apache Spark commented on SPARK-28154:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/24919

> GMM fix double caching
> --
>
> Key: SPARK-28154
> URL: https://issues.apache.org/jira/browse/SPARK-28154
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> The intermediate rdd is always cached.  We should only cache it if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28154) GMM fix double caching

2019-06-24 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-28154:


 Summary: GMM fix double caching
 Key: SPARK-28154
 URL: https://issues.apache.org/jira/browse/SPARK-28154
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.0, 2.3.0, 3.0.0
Reporter: zhengruifeng


The intermediate rdd is always cached.  We should only cache it if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871885#comment-16871885
 ] 

Hyukjin Kwon edited comment on SPARK-27966 at 6/25/19 1:01 AM:
---

While debugging this, I found one case not working:

{code}
from pyspark.sql.functions import udf, input_file_name
spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
input_file_name()).show()
{code}

{code}
++-+
|(id)|input_file_name()|
++-+
|   8| |
|   5| |
|   0| |
|   9| |
|   6| |
|   2| |
|   3| |
|   4| |
|   7| |
|   1| |
++-+
{code}

But the reproducer described here works. Is this same issue or different issue, 
[~Chr_96er]?


was (Author: hyukjin.kwon):
While debugging this, I found one case not working:

{code}
from pyspark.sql.functions import udf, input_file_name
spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
input_file_name()).show()
{code}

{code}
++-+
|(id)|input_file_name()|
++-+
|   8| |
|   5| |
|   0| |
|   9| |
|   6| |
|   2| |
|   3| |
|   4| |
|   7| |
|   1| |
++-+
{code}

But the reproducer described here works. Is this same issue or different issue, 
[~Chr_96er]?

> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>Reporter: Christian Homberg
>Priority: Minor
> Attachments: input_file_name_bug
>
>
> I ran into an issue similar and probably related to SPARK-26128. The 
> _org.apache.spark.sql.functions.input_file_name_ is sometimes empty.
>  
> {code:java}
> df.select(input_file_name()).show(5,false)
> {code}
>  
> {code:java}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}
> My environment is databricks and debugging the Log4j output showed me that 
> the issue occurred when the files are being listed in parallel, e.g. when 
> {code:java}
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 127; threshold: 32
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under:{code}
>  
> Everything's fine as long as
> {code:java}
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 6; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> {code}
>  
> Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  
> resolves the issue for me.
>  
> *edit: the problem is not exclusively linked to listing files in parallel. 
> I've setup a larger cluster for which after parallel file listing the 
> input_file_name did return the correct filename. After inspecting the log4j 
> again, I assume that it's linked to some kind of MetaStore being full. I've 
> attached a section of the log4j output that I think should indicate why it's 
> failing. If you need more, please let me know.*
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-

[jira] [Created] (SPARK-28153) input_file_name doesn't work with Python UDF in the same project

2019-06-24 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28153:


 Summary: input_file_name doesn't work with Python UDF in the same 
project
 Key: SPARK-28153
 URL: https://issues.apache.org/jira/browse/SPARK-28153
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


{code}
from pyspark.sql.functions import udf, input_file_name
spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
input_file_name()).show()
{code}

{code}
++-+
|(id)|input_file_name()|
++-+
|   8| |
|   5| |
|   0| |
|   9| |
|   6| |
|   2| |
|   3| |
|   4| |
|   7| |
|   1| |
++-+
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871885#comment-16871885
 ] 

Hyukjin Kwon commented on SPARK-27966:
--

While debugging this, I found one case not working:

{code}
from pyspark.sql.functions import udf, input_file_name
spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
input_file_name()).show()
{code}

{code}
++-+
|(id)|input_file_name()|
++-+
|   8| |
|   5| |
|   0| |
|   9| |
|   6| |
|   2| |
|   3| |
|   4| |
|   7| |
|   1| |
++-+
{code}

But the reproducer described here works. Is this same issue or different issue, 
[~Chr_96er]?

> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>Reporter: Christian Homberg
>Priority: Minor
> Attachments: input_file_name_bug
>
>
> I ran into an issue similar and probably related to SPARK-26128. The 
> _org.apache.spark.sql.functions.input_file_name_ is sometimes empty.
>  
> {code:java}
> df.select(input_file_name()).show(5,false)
> {code}
>  
> {code:java}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}
> My environment is databricks and debugging the Log4j output showed me that 
> the issue occurred when the files are being listed in parallel, e.g. when 
> {code:java}
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 127; threshold: 32
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under:{code}
>  
> Everything's fine as long as
> {code:java}
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 6; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> {code}
>  
> Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  
> resolves the issue for me.
>  
> *edit: the problem is not exclusively linked to listing files in parallel. 
> I've setup a larger cluster for which after parallel file listing the 
> input_file_name did return the correct filename. After inspecting the log4j 
> again, I assume that it's linked to some kind of MetaStore being full. I've 
> attached a section of the log4j output that I think should indicate why it's 
> failing. If you need more, please let me know.*
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26896) Add maven profiles for running tests with JDK 11

2019-06-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26896.
---
Resolution: Not A Problem

> Add maven profiles for running tests with JDK 11
> 
>
> Key: SPARK-26896
> URL: https://issues.apache.org/jira/browse/SPARK-26896
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> Running unit tests w/ JDK 11 trips over some issues w/ the new module system. 
>  These can be worked around with the new {{--add-opens}} etc. commands.  I 
> think we need to add a build profile for JDK 11 to add some extra args to the 
> test runners.
> In particular:
> 1) removal of jaxb from java itself (used in pmml export in mllib)
> 2) Some reflective access which results in failures, eg. 
> {noformat}
> Unable to make field jdk.internal.ref.PhantomCleanable
> jdk.internal.ref.PhantomCleanable.prev accessible: module java.base does
> not "opens jdk.internal.ref" to unnamed module
> {noformat}
> 3) Some reflective access which results in warnings (you can add 
> {{--illegal-access=warn}} to see all of these).
> All I'm proposing we do here is put in the required handling to make these 
> problems go away, not necessarily do the "right" thing by no longer 
> referencing these unexposed internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics

2019-06-24 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871839#comment-16871839
 ] 

Steve Loughran commented on SPARK-28091:


The real source of metrics for S3A (and the Azure stores, Google cloud) is 
actually the per-Instance StorageStatistics you can get from 
{{getStorageStatistics)) on an instance; look at 
{{org.apache.hadoop.fs.s3a.Statistic}} for what is collected there...pretty 
much every operation has its own counter.

It'd be interesting to see what your collection code does here. Also be 
interesting to think about whether people could extend it to get low level 
stats from CPUs themselves. 

One issue I have found with trying to collect per-query stats is that the 
filesystem-instance counters can't be tied down to specific queries, as they 
aren't per-thread. No good solutions there, at least nothing under dev, though 
the Impala team have been asking for stuff (primarily collecting input stream 
stats on seek cost)

> Extend Spark metrics system with executor plugin metrics
> 
>
> Key: SPARK-28091
> URL: https://issues.apache.org/jira/browse/SPARK-28091
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to improve Spark instrumentation by adding a hook for Spark 
> executor plugin metrics to the Spark metrics systems implemented with the 
> Dropwizard/Codahale library.
> Context: The Spark metrics system provides a large variety of metrics, see 
> also SPARK-26890, useful to  monitor and troubleshoot Spark workloads. A 
> typical workflow is to sink the metrics to a storage system and build 
> dashboards on top of that.
> Improvement: The original goal of this work was to add instrumentation for S3 
> filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
> instruments HDFS and local filesystem metrics. Rather than extending the code 
> there, we proposes to add a metrics plugin system which is of more flexible 
> and general use.
> Advantages:
>  * The metric plugin system makes it easy to implement instrumentation for S3 
> access by Spark jobs.
>  * The metrics plugin system allows for easy extensions of how Spark collects 
> HDFS-related workload metrics. This is currently done using the Hadoop 
> Filesystem GetAllStatistics method, which is deprecated in recent versions of 
> Hadoop. Recent versions of Hadoop Filesystem recommend using method 
> GetGlobalStorageStatistics, which also provides several additional metrics. 
> GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been 
> introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an 
> easy way to “opt in” using such new API calls for those deploying suitable 
> Hadoop versions.
>  * We also have the use case of adding Hadoop filesystem monitoring for a 
> custom Hadoop compliant filesystem in use in our organization (EOS using the 
> XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
> Others may have similar use cases.
>  * More generally, this method makes it straightforward to plug in Filesystem 
> and other metrics to the Spark monitoring system. Future work on plugin 
> implementation can address extending monitoring to measure usage of external 
> resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
> not normally be considered general enough for inclusion in Apache Spark code, 
> but that can be nevertheless useful for specialized use cases, tests or 
> troubleshooting.
> Implementation:
> The proposed implementation is currently a WIP open for comments and 
> improvements. It is based on the work on Executor Plugin of SPARK-24918 and 
> builds on recent work on extending Spark executor metrics, such as SPARK-25228
> Tests and examples:
> This has been so far manually tested running Spark on YARN and K8S clusters, 
> in particular for monitoring S3 and for extending HDFS instrumentation with 
> the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric 
> plugin example and code used for testing are available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28150) Failure to create multiple contexts in same JVM with Kerberos auth

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28150:


Assignee: Apache Spark

> Failure to create multiple contexts in same JVM with Kerberos auth
> --
>
> Key: SPARK-28150
> URL: https://issues.apache.org/jira/browse/SPARK-28150
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> Take the following small app that creates multiple contexts (not 
> concurrently):
> {code}
> from pyspark.context import SparkContext
> import time
> for i in range(2):
>   with SparkContext() as sc:
> pass
>   time.sleep(5)
> {code}
> This fails when kerberos (without dt renewal) is being used:
> {noformat}
> 19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49)
> Caused by: 
> org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: Error 
> calling method hbase.pb.AuthenticationService.GetAuthenticationToken
> at 
> org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.security.AccessDeniedException):
>  org.apache.hadoop.hbase.security.AccessDeniedException: Token generation 
> only allowed for Kerberos authenticated clients
> at 
> org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:126)
> {noformat}
> If you enable dt renewal things work since the codes takes a slightly 
> different path when generating the initial delegation tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28150) Failure to create multiple contexts in same JVM with Kerberos auth

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28150:


Assignee: (was: Apache Spark)

> Failure to create multiple contexts in same JVM with Kerberos auth
> --
>
> Key: SPARK-28150
> URL: https://issues.apache.org/jira/browse/SPARK-28150
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Take the following small app that creates multiple contexts (not 
> concurrently):
> {code}
> from pyspark.context import SparkContext
> import time
> for i in range(2):
>   with SparkContext() as sc:
> pass
>   time.sleep(5)
> {code}
> This fails when kerberos (without dt renewal) is being used:
> {noformat}
> 19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49)
> Caused by: 
> org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: Error 
> calling method hbase.pb.AuthenticationService.GetAuthenticationToken
> at 
> org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.security.AccessDeniedException):
>  org.apache.hadoop.hbase.security.AccessDeniedException: Token generation 
> only allowed for Kerberos authenticated clients
> at 
> org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:126)
> {noformat}
> If you enable dt renewal things work since the codes takes a slightly 
> different path when generating the initial delegation tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28152) ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector

2019-06-24 Thread Shiv Prashant Sood (JIRA)
Shiv Prashant Sood created SPARK-28152:
--

 Summary: ShortType and FloatTypes are not correctly mapped to 
right JDBC types when using JDBC connector
 Key: SPARK-28152
 URL: https://issues.apache.org/jira/browse/SPARK-28152
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3, 3.0.0
Reporter: Shiv Prashant Sood


ShortType and FloatTypes are not correctly mapped to right JDBC types when 
using JDBC connector. This results in tables or spark data frame being created 
with unintended types.

Some example issue
 * Write from df with column type results in a SQL table of with column type as 
INTEGER as opposed to SMALLINT. Thus a larger table that expected.
 * read results in a dataframe with type INTEGER as opposed to ShortType  

FloatTypes have a issue with read path. In the write path Spark data type 
'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the 
read path when JDBC data types need to be converted to Catalyst data types ( 
getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather 
than 'FloatType'.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24432) Add support for dynamic resource allocation

2019-06-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24432:
--

Assignee: (was: Marcelo Vanzin)

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24432) Add support for dynamic resource allocation

2019-06-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24432:
--

Assignee: Marcelo Vanzin

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Yinan Li
>Assignee: Marcelo Vanzin
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16474) Global Aggregation doesn't seem to work at all

2019-06-24 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871798#comment-16871798
 ] 

Josh Rosen commented on SPARK-16474:


I just ran into this same issue.

The problem here is that {{RelationalGroupedDataset}} only supports custom 
aggregate expressions which are defined over \{{Row}}s; you can see this from 
the code at 
[https://github.com/apache/spark/blob/67042e90e763f6a2716b942c3f887e94813889ce/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L225].

The solution here is to use the typed {{KeyValueGroupedDataset}} API via 
{{groupByKey()}}.

Unfortunately there's not a great way to provide better error messages here: 
{{Aggregator}} doesn't require {{TypeTag}} or {{ClassTag}} for its input type, 
so we lack a reliable mechanism to detect and fail-fast when we're passed an 
aggregate over non-Rows.

 

> Global Aggregation doesn't seem to work at all 
> ---
>
> Key: SPARK-16474
> URL: https://issues.apache.org/jira/browse/SPARK-16474
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Amit Sela
>Priority: Major
>
> Executing a global aggregation (not grouped by key) fails.
> Take the following code for example:
> {code}
> val session = SparkSession.builder()
>   .appName("TestGlobalAggregator")
>   .master("local[*]")
>   .getOrCreate()
> import session.implicits._
> val ds1 = List(1, 2, 3).toDS
> val ds2 = ds1.agg(
>   new Aggregator[Int, Int, Int]{
>   def zero: Int = 0
>   def reduce(b: Int, a: Int): Int = b + a
>   def merge(b1: Int, b2: Int): Int = b1 + b2
>   def finish(reduction: Int): Int = reduction
>   def bufferEncoder: Encoder[Int] = implicitly[Encoder[Int]]
>   def outputEncoder: Encoder[Int] = implicitly[Encoder[Int]]
> }.toColumn)
> ds2.printSchema
> ds2.show
> {code}
> I would expect the result to be 6, but instead I get the following exception:
> {noformat}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to java.lang.Integer 
> at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
> .
> {noformat}
> Trying the same code on DataFrames in 1.6.2 results in:
> {noformat}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved 
> operator 'Aggregate [(anon$1(),mode=Complete,isDistinct=false) AS 
> anon$1()#8]; 
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> ..
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28151) ByteType is not correctly mapped for read/write of SQLServer tables

2019-06-24 Thread Shiv Prashant Sood (JIRA)
Shiv Prashant Sood created SPARK-28151:
--

 Summary: ByteType is not correctly mapped for read/write of 
SQLServer tables
 Key: SPARK-28151
 URL: https://issues.apache.org/jira/browse/SPARK-28151
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3, 3.0.0
Reporter: Shiv Prashant Sood


Writing dataframe with column type BYTETYPE fails when using JDBC connector for 
SQL Server. Append and Read of tables also fail. The problem is due 

1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in 
jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped to 
TINYINT
{color:#cc7832}case {color}ByteType => 
Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, 
{color}java.sql.Types.{color:#9876aa}TINYINT{color}))

In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to 
INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from 
the point of view of upcasting, but will lead to 4 byte allocation rather than 
1 byte for BYTETYPE.



2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: 
Metadata). The function sets the value in RDD row. The value is set per the 
data type. Here there is no mapping for BYTETYPE and thus results will result 
in an error when getCatalystType() is fixed.

Note : These issues were found when reading/writing with SQLServer. Will be 
submitting a PR soon to fix these mappings in MSSQLServerDialect.

Error seen when writing table

(JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, 
parameter, or variable #2: *Cannot find data type BYTE*.)
com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable 
#2: Cannot find data type BYTE.
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859)
 ..

 

 

 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25994) SPIP: Property Graphs, Cypher Queries, and Algorithms

2019-06-24 Thread Ruben Berenguel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871677#comment-16871677
 ] 

Ruben Berenguel commented on SPARK-25994:
-

Hi [~mju] sounds good to me (sorry for the delay, conference season). I assume 
this next work is in SPARK-27303 and the corresponding PR? I'm not sure how 
much time I can give it in the coming couple of weeks, but I'll try to stay on 
top of the API design as much as I can

> SPIP: Property Graphs, Cypher Queries, and Algorithms
> -
>
> Key: SPARK-25994
> URL: https://issues.apache.org/jira/browse/SPARK-25994
> Project: Spark
>  Issue Type: Epic
>  Components: Graph
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Martin Junghanns
>Priority: Major
>  Labels: SPIP
>
> Copied from the SPIP doc:
> {quote}
> GraphX was one of the foundational pillars of the Spark project, and is the 
> current graph component. This reflects the importance of the graphs data 
> model, which naturally pairs with an important class of analytic function, 
> the network or graph algorithm. 
> However, GraphX is not actively maintained. It is based on RDDs, and cannot 
> exploit Spark 2’s Catalyst query engine. GraphX is only available to Scala 
> users.
> GraphFrames is a Spark package, which implements DataFrame-based graph 
> algorithms, and also incorporates simple graph pattern matching with fixed 
> length patterns (called “motifs”). GraphFrames is based on DataFrames, but 
> has a semantically weak graph data model (based on untyped edges and 
> vertices). The motif pattern matching facility is very limited by comparison 
> with the well-established Cypher language. 
> The Property Graph data model has become quite widespread in recent years, 
> and is the primary focus of commercial graph data management and of graph 
> data research, both for on-premises and cloud data management. Many users of 
> transactional graph databases also wish to work with immutable graphs in 
> Spark.
> The idea is to define a Cypher-compatible Property Graph type based on 
> DataFrames; to replace GraphFrames querying with Cypher; to reimplement 
> GraphX/GraphFrames algos on the PropertyGraph type. 
> To achieve this goal, a core subset of Cypher for Apache Spark (CAPS), 
> reusing existing proven designs and code, will be employed in Spark 3.0. This 
> graph query processor, like CAPS, will overlay and drive the SparkSQL 
> Catalyst query engine, using the CAPS graph query planner.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28140) Pyspark API to create spark.mllib RowMatrix from DataFrame

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28140:


Assignee: (was: Apache Spark)

> Pyspark API to create spark.mllib RowMatrix from DataFrame
> --
>
> Key: SPARK-28140
> URL: https://issues.apache.org/jira/browse/SPARK-28140
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Henry Davidge
>Priority: Major
>
> Since many functions are only implemented in spark.mllib, it is often 
> necessary to convert DataFrames of spark.ml vectors to spark.mllib 
> distributed matrix formats. The first step, converting the spark.ml vectors 
> to the spark.mllib equivalent, is straightforward. However, to the best of my 
> knowledge it's not possible to convert the resulting DataFrame to a RowMatrix 
> without using a python lambda function, which can have a significant 
> performance hit. In my recent use case, SVD took 3.5m using the Scala API, 
> but 12m using Python.
> To get around this performance hit, I propose adding a constructor to the 
> Pyspark RowMatrix class that accepts a DataFrame with a single column of 
> spark.mllib vectors. I'd be happy to add an equivalent API for 
> IndexedRowMatrix if there is demand.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28140) Pyspark API to create spark.mllib RowMatrix from DataFrame

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28140:


Assignee: Apache Spark

> Pyspark API to create spark.mllib RowMatrix from DataFrame
> --
>
> Key: SPARK-28140
> URL: https://issues.apache.org/jira/browse/SPARK-28140
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Henry Davidge
>Assignee: Apache Spark
>Priority: Major
>
> Since many functions are only implemented in spark.mllib, it is often 
> necessary to convert DataFrames of spark.ml vectors to spark.mllib 
> distributed matrix formats. The first step, converting the spark.ml vectors 
> to the spark.mllib equivalent, is straightforward. However, to the best of my 
> knowledge it's not possible to convert the resulting DataFrame to a RowMatrix 
> without using a python lambda function, which can have a significant 
> performance hit. In my recent use case, SVD took 3.5m using the Scala API, 
> but 12m using Python.
> To get around this performance hit, I propose adding a constructor to the 
> Pyspark RowMatrix class that accepts a DataFrame with a single column of 
> spark.mllib vectors. I'd be happy to add an equivalent API for 
> IndexedRowMatrix if there is demand.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28150) Failure to create multiple contexts in same JVM with Kerberos auth

2019-06-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-28150:
---
Description: 
Take the following small app that creates multiple contexts (not concurrently):

{code}
from pyspark.context import SparkContext
import time

for i in range(2):
  with SparkContext() as sc:
pass
  time.sleep(5)
{code}

This fails when kerberos (without dt renewal) is being used:

{noformat}
19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49)
Caused by: org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: 
Error calling method hbase.pb.AuthenticationService.GetAuthenticationToken
at 
org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71)
Caused by: 
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.security.AccessDeniedException):
 org.apache.hadoop.hbase.security.AccessDeniedException: Token generation only 
allowed for Kerberos authenticated clients
at 
org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:126)
{noformat}

If you enable dt renewal things work since the codes takes a slightly different 
path when generating the initial delegation tokens.

  was:
Take the following small app that creates multiple contexts (not concurrently):

{code}
from pyspark.context import SparkContext
import time

for i in range(2):
  with SparkContext() as sc:
pass
  time.sleep(5)
{code}

This fails when kerberos (without dt renewal) is being used:

{noformat}
19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49)
Caused by: org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: 
Error calling method hbase.pb.AuthenticationService.GetAuthenticationToken
at 
org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71)
{noformat}

If you enable dt renewal things work since the codes takes a slightly different 
path when generating the initial delegation tokens.


> Failure to create multiple contexts in same JVM with Kerberos auth
> --
>
> Key: SPARK-28150
> URL: https://issues.apache.org/jira/browse/SPARK-28150
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Take the following small app that creates multiple contexts (not 
> concurrently):
> {code}
> from pyspark.context import SparkContext
> import time
> for i in range(2):
>   with SparkContext() as sc:
> pass
>   time.sleep(5)
> {code}
> This fails when kerberos (without dt renewal) is being used:
> {noformat}
> 19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49)
> Caused by: 
> org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: Error 
> calling method hbase.pb.AuthenticationService.GetAuthenticationToken
> at 
> org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.security.AccessDeniedException):
>  org.apache.hadoop.hbase.security.AccessDeniedExcep

[jira] [Created] (SPARK-28150) Failure to create multiple contexts in same JVM with Kerberos auth

2019-06-24 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-28150:
--

 Summary: Failure to create multiple contexts in same JVM with 
Kerberos auth
 Key: SPARK-28150
 URL: https://issues.apache.org/jira/browse/SPARK-28150
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


Take the following small app that creates multiple contexts (not concurrently):

{code}
from pyspark.context import SparkContext
import time

for i in range(2):
  with SparkContext() as sc:
pass
  time.sleep(5)
{code}

This fails when kerberos (without dt renewal) is being used:

{noformat}
19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49)
Caused by: org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: 
Error calling method hbase.pb.AuthenticationService.GetAuthenticationToken
at 
org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71)
{noformat}

If you enable dt renewal things work since the codes takes a slightly different 
path when generating the initial delegation tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28114) Add Jenkins job for `Hadoop-3.2` profile

2019-06-24 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp resolved SPARK-28114.
-
Resolution: Fixed

going to mark this as resolved.  builds are green (WRT hadoop-3.2).

> Add Jenkins job for `Hadoop-3.2` profile
> 
>
> Key: SPARK-28114
> URL: https://issues.apache.org/jira/browse/SPARK-28114
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: shane knapp
>Priority: Major
>
> Spark 3.0 is a major version change. We want to have the following new Jobs.
> 1. SBT with hadoop-3.2
> 2. Maven with hadoop-3.2 (on JDK8 and JDK11)
> Also, shall we have a limit for the concurrent run for the following existing 
> job? Currently, it invokes multiple jobs concurrently. We can save the 
> resource by limiting to 1 like the other jobs.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing
> We will drop four `branch-2.3` jobs at the end of August, 2019.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT

2019-06-24 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-28003.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24844
[https://github.com/apache/spark/pull/24844]

> spark.createDataFrame with Arrow doesn't work with pandas.NaT 
> --
>
> Key: SPARK-28003
> URL: https://issues.apache.org/jira/browse/SPARK-28003
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:java}
> import pandas as pd
> dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100
> pdf1 = pd.DataFrame({'time': dt1})
> df1 = self.spark.createDataFrame(pdf1)
> {code}
> The example above doesn't work with arrow enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT

2019-06-24 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-28003:


Assignee: Li Jin

> spark.createDataFrame with Arrow doesn't work with pandas.NaT 
> --
>
> Key: SPARK-28003
> URL: https://issues.apache.org/jira/browse/SPARK-28003
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
>
> {code:java}
> import pandas as pd
> dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100
> pdf1 = pd.DataFrame({'time': dt1})
> df1 = self.spark.createDataFrame(pdf1)
> {code}
> The example above doesn't work with arrow enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27992) PySpark socket server should sync with JVM connection thread future

2019-06-24 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27992:
-
Description: 
Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark job 
are not propagated to Python. This is because toLocalIterator() and toPandas() 
with Arrow enabled run Spark jobs asynchronously in a background thread, after 
creating the socket connection info. The fix for these was to catch a 
SparkException if the job errored and then send the exception through the 
pyspark serializer.

A better fix would be to allow Python to await on the serving thread future and 
join the thread. That way if the serving thread throws an exception, it will be 
propagated on the call to awaitResult.

  was:
Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark job 
are not propagated to Python. This is because toLocalIterator() and toPandas() 
with Arrow enabled run Spark jobs asynchronously in a background thread, after 
creating the socket connection info. The fix for these was to catch a 
SparkException if the job errored and then send the exception through the 
pyspark serializer.

A better fix would be to allow Python to synchronize on the serving thread 
future. That way if the serving thread throws an exception, it will be 
propagated on the synchronization call.


> PySpark socket server should sync with JVM connection thread future
> ---
>
> Key: SPARK-27992
> URL: https://issues.apache.org/jira/browse/SPARK-27992
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark 
> job are not propagated to Python. This is because toLocalIterator() and 
> toPandas() with Arrow enabled run Spark jobs asynchronously in a background 
> thread, after creating the socket connection info. The fix for these was to 
> catch a SparkException if the job errored and then send the exception through 
> the pyspark serializer.
> A better fix would be to allow Python to await on the serving thread future 
> and join the thread. That way if the serving thread throws an exception, it 
> will be propagated on the call to awaitResult.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28149) Disable negeative DNS caching

2019-06-24 Thread Jose Luis Pedrosa (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Luis Pedrosa updated SPARK-28149:
--
Description: 
By default JVM caches the failures for the DNS resolutions, by default is 
cached by 10 seconds.

Alpine JDK used in the images for kubernetes has a default timout of 5 seconds.

This means that in clusters with slow init time (network sidecar pods, slow 
network start up) executor will never run, because the first attempt to connect 
to the driver will fail, and that failure will be cached, causing  the retries 
to happen in a tight loop without actually trying again.

 

The proposed implementation would be to add to the entrypoint.sh (that is 
exclusive for k8s) to alter the file with the dns caching, and disable it if 
there's an environment variable as "DISABLE_DNS_NEGATIVE_CACHING" defined. 

 

  was:
By default JVM caches the failures for the DNS resolutions, by default is 
cached by 10 seconds.

Alpine JDK used in the images for kubernetes has a default timout of 5 seconds.

This means that in clusters with slow init time (network sidecar pods, slow 
network start up) executor will never run, because the first attempt to connect 
to the driver will fail, and that failure will be cached, causing  the retries 
to happen in a tight loop without actually trying again.

 

The proposed implementation would be to add to the entrypoint.sh (that is 
exclusive for k8s) to alter the file with the dns caching, and disable it if 
there's an environment variable as "DISABLE_DNS_NEGATIVE_CAHING" defined. 

 


> Disable negeative DNS caching
> -
>
> Key: SPARK-28149
> URL: https://issues.apache.org/jira/browse/SPARK-28149
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Jose Luis Pedrosa
>Priority: Minor
>
> By default JVM caches the failures for the DNS resolutions, by default is 
> cached by 10 seconds.
> Alpine JDK used in the images for kubernetes has a default timout of 5 
> seconds.
> This means that in clusters with slow init time (network sidecar pods, slow 
> network start up) executor will never run, because the first attempt to 
> connect to the driver will fail, and that failure will be cached, causing  
> the retries to happen in a tight loop without actually trying again.
>  
> The proposed implementation would be to add to the entrypoint.sh (that is 
> exclusive for k8s) to alter the file with the dns caching, and disable it if 
> there's an environment variable as "DISABLE_DNS_NEGATIVE_CACHING" defined. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28149) Disable negeative DNS caching

2019-06-24 Thread Jose Luis Pedrosa (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Luis Pedrosa updated SPARK-28149:
--
Description: 
By default JVM caches the failures for the DNS resolutions, by default is 
cached by 10 seconds.

Alpine JDK used in the images for kubernetes has a default timout of 5 seconds.

This means that in clusters with slow init time (network sidecar pods, slow 
network start up) executor will never run, because the first attempt to connect 
to the driver will fail, and that failure will be cached, causing  the retries 
to happen in a tight loop without actually trying again.

 

The proposed implementation would be to add to the entrypoint.sh (that is 
exclusive for k8s) to alter the file with the dns caching, and disable it if 
there's an environment variable as "DISABLE_DNS_NEGATIVE_CAHING" defined. 

 

  was:
By default JVM caches the failures for the DNS resolutions, by default is 
cached by 10 seconds.

Alpine JDK used in the images for kubernetes has a default timout of 5 seconds.

This means that in clusters with slow init time (network sidecar pods, slow 
network start up) executor will never run, because the first attempt to connect 
to the driver will fail, and that failure will be cached, causing  the retries 
to happen in a tight loop without actually trying again.

 

 


> Disable negeative DNS caching
> -
>
> Key: SPARK-28149
> URL: https://issues.apache.org/jira/browse/SPARK-28149
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Jose Luis Pedrosa
>Priority: Minor
>
> By default JVM caches the failures for the DNS resolutions, by default is 
> cached by 10 seconds.
> Alpine JDK used in the images for kubernetes has a default timout of 5 
> seconds.
> This means that in clusters with slow init time (network sidecar pods, slow 
> network start up) executor will never run, because the first attempt to 
> connect to the driver will fail, and that failure will be cached, causing  
> the retries to happen in a tight loop without actually trying again.
>  
> The proposed implementation would be to add to the entrypoint.sh (that is 
> exclusive for k8s) to alter the file with the dns caching, and disable it if 
> there's an environment variable as "DISABLE_DNS_NEGATIVE_CAHING" defined. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28149) Disable negeative DNS caching

2019-06-24 Thread Jose Luis Pedrosa (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Luis Pedrosa updated SPARK-28149:
--
Summary: Disable negeative DNS caching  (was: Disable negeative DNS 
caching.)

> Disable negeative DNS caching
> -
>
> Key: SPARK-28149
> URL: https://issues.apache.org/jira/browse/SPARK-28149
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Jose Luis Pedrosa
>Priority: Minor
>
> By default JVM caches the failures for the DNS resolutions, by default is 
> cached by 10 seconds.
> Alpine JDK used in the images for kubernetes has a default timout of 5 
> seconds.
> This means that in clusters with slow init time (network sidecar pods, slow 
> network start up) executor will never run, because the first attempt to 
> connect to the driver will fail, and that failure will be cached, causing  
> the retries to happen in a tight loop without actually trying again.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28149) Disable negeative DNS caching.

2019-06-24 Thread Jose Luis Pedrosa (JIRA)
Jose Luis Pedrosa created SPARK-28149:
-

 Summary: Disable negeative DNS caching.
 Key: SPARK-28149
 URL: https://issues.apache.org/jira/browse/SPARK-28149
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.3
Reporter: Jose Luis Pedrosa


By default JVM caches the failures for the DNS resolutions, by default is 
cached by 10 seconds.

Alpine JDK used in the images for kubernetes has a default timout of 5 seconds.

This means that in clusters with slow init time (network sidecar pods, slow 
network start up) executor will never run, because the first attempt to connect 
to the driver will fail, and that failure will be cached, causing  the retries 
to happen in a tight loop without actually trying again.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28148) repartition after join is not optimized away

2019-06-24 Thread colin fang (JIRA)
colin fang created SPARK-28148:
--

 Summary: repartition after join is not optimized away
 Key: SPARK-28148
 URL: https://issues.apache.org/jira/browse/SPARK-28148
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: colin fang


Partitioning & sorting is usually retained after join.

{code}
spark.conf.set('spark.sql.shuffle.partitions', '42')

df1 = spark.range(500, numPartitions=5)
df2 = spark.range(1000, numPartitions=5)
df3 = spark.range(2000, numPartitions=5)

# Reuse previous partitions & sort.
df1.join(df2, on='id').join(df3, on='id').explain()
# == Physical Plan ==
# *(8) Project [id#367L]
# +- *(8) SortMergeJoin [id#367L], [id#374L], Inner
#:- *(5) Project [id#367L]
#:  +- *(5) SortMergeJoin [id#367L], [id#369L], Inner
#: :- *(2) Sort [id#367L ASC NULLS FIRST], false, 0
#: :  +- Exchange hashpartitioning(id#367L, 42)
#: : +- *(1) Range (0, 500, step=1, splits=5)
#: +- *(4) Sort [id#369L ASC NULLS FIRST], false, 0
#:+- Exchange hashpartitioning(id#369L, 42)
#:   +- *(3) Range (0, 1000, step=1, splits=5)
#+- *(7) Sort [id#374L ASC NULLS FIRST], false, 0
#   +- Exchange hashpartitioning(id#374L, 42)
#  +- *(6) Range (0, 2000, step=1, splits=5)

{code}

However here:  Partitions persist through left join, sort doesn't.

{code}
df1.join(df2, on='id', 
how='left').repartition('id').sortWithinPartitions('id').explain()
# == Physical Plan ==
# *(5) Sort [id#367L ASC NULLS FIRST], false, 0
# +- *(5) Project [id#367L]
#+- SortMergeJoin [id#367L], [id#369L], LeftOuter
#   :- *(2) Sort [id#367L ASC NULLS FIRST], false, 0
#   :  +- Exchange hashpartitioning(id#367L, 42)
#   : +- *(1) Range (0, 500, step=1, splits=5)
#   +- *(4) Sort [id#369L ASC NULLS FIRST], false, 0
#  +- Exchange hashpartitioning(id#369L, 42)
# +- *(3) Range (0, 1000, step=1, splits=5)
{code}

Also here: Partitions do not persist though inner join.


{code}
df1.join(df2, on='id').repartition('id').sortWithinPartitions('id').explain()
# == Physical Plan ==
# *(6) Sort [id#367L ASC NULLS FIRST], false, 0
# +- Exchange hashpartitioning(id#367L, 42)
#+- *(5) Project [id#367L]
#   +- *(5) SortMergeJoin [id#367L], [id#369L], Inner
#  :- *(2) Sort [id#367L ASC NULLS FIRST], false, 0
#  :  +- Exchange hashpartitioning(id#367L, 42)
#  : +- *(1) Range (0, 500, step=1, splits=5)
#  +- *(4) Sort [id#369L ASC NULLS FIRST], false, 0
# +- Exchange hashpartitioning(id#369L, 42)
#+- *(3) Range (0, 1000, step=1, splits=5)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28146) Support IS OF () predicate

2019-06-24 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871474#comment-16871474
 ] 

Yuming Wang commented on SPARK-28146:
-

{{IS OF}} does not conform to the ISO SQL behavior, so it is undocumented:
https://github.com/postgres/postgres/blob/REL_12_BETA1/doc/src/sgml/func.sgml#L518-L535

> Support IS OF () predicate
> 
>
> Key: SPARK-28146
> URL: https://issues.apache.org/jira/browse/SPARK-28146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Priority: Minor
>
> PostgreSQL supports IS OF () predicate, for example the following query 
> is valid:
> {noformat}
> select 1 is of (int), true is of (bool)
> true true
> {noformat}
> I can't find PostgreSQL documentation about it, but here is how it works in 
> Oracle:
>  
> [https://docs.oracle.com/cd/B28359_01/server.111/b28286/conditions014.htm#SQLRF52157]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28147) Support RETURNING cause

2019-06-24 Thread Peter Toth (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-28147:
---
Description: 
PostgreSQL supports RETURNING cause on INSERT/UPDATE/DELETE statements to 
return date from the modified rows.

[https://www.postgresql.org/docs/9.5/dml-returning.html]

  was:
PostgreSQL supports RETURNING cause on INSERT/UPDATE/DELETE statements to 
return the modified rows.

[https://www.postgresql.org/docs/9.5/dml-returning.html]


> Support RETURNING cause
> ---
>
> Key: SPARK-28147
> URL: https://issues.apache.org/jira/browse/SPARK-28147
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Priority: Major
>
> PostgreSQL supports RETURNING cause on INSERT/UPDATE/DELETE statements to 
> return date from the modified rows.
> [https://www.postgresql.org/docs/9.5/dml-returning.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28147) Support RETURNING cause

2019-06-24 Thread Peter Toth (JIRA)
Peter Toth created SPARK-28147:
--

 Summary: Support RETURNING cause
 Key: SPARK-28147
 URL: https://issues.apache.org/jira/browse/SPARK-28147
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Peter Toth


PostgreSQL supports RETURNING cause on INSERT/UPDATE/DELETE statements to 
return the modified rows.

[https://www.postgresql.org/docs/9.5/dml-returning.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28146) Support IS OF () predicate

2019-06-24 Thread Peter Toth (JIRA)
Peter Toth created SPARK-28146:
--

 Summary: Support IS OF () predicate
 Key: SPARK-28146
 URL: https://issues.apache.org/jira/browse/SPARK-28146
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Peter Toth


PostgreSQL supports IS OF () predicate, for example the following query 
is valid:
{noformat}
select 1 is of (int), true is of (bool)

true true
{noformat}
I can't find PostgreSQL documentation about it, but here is how it works in 
Oracle:
 
[https://docs.oracle.com/cd/B28359_01/server.111/b28286/conditions014.htm#SQLRF52157]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28145) Executor pods polling source can fail to replace dead executors

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28145:


Assignee: (was: Apache Spark)

> Executor pods polling source can fail to replace dead executors
> ---
>
> Key: SPARK-28145
> URL: https://issues.apache.org/jira/browse/SPARK-28145
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Onur Satici
>Priority: Major
>
> Scheduled task responsible for reporting executor snapshots to the executor 
> allocator in kubernetes will die on any error, killing subsequent runs of the 
> same task. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28145) Executor pods polling source can fail to replace dead executors

2019-06-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28145:


Assignee: Apache Spark

> Executor pods polling source can fail to replace dead executors
> ---
>
> Key: SPARK-28145
> URL: https://issues.apache.org/jira/browse/SPARK-28145
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Onur Satici
>Assignee: Apache Spark
>Priority: Major
>
> Scheduled task responsible for reporting executor snapshots to the executor 
> allocator in kubernetes will die on any error, killing subsequent runs of the 
> same task. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28145) Executor pods polling source can fail to replace dead executors

2019-06-24 Thread Onur Satici (JIRA)
Onur Satici created SPARK-28145:
---

 Summary: Executor pods polling source can fail to replace dead 
executors
 Key: SPARK-28145
 URL: https://issues.apache.org/jira/browse/SPARK-28145
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 2.4.3, 3.0.0
Reporter: Onur Satici


Scheduled task responsible for reporting executor snapshots to the executor 
allocator in kubernetes will die on any error, killing subsequent runs of the 
same task. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier

2019-06-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27018:
-

Assignee: zhengruifeng

> Checkpointed RDD deleted prematurely when using GBTClassifier
> -
>
> Key: SPARK-27018
> URL: https://issues.apache.org/jira/browse/SPARK-27018
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core
>Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0
> Environment: OS: Ubuntu Linux 18.10
> Java: java version "1.8.0_201"
> Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
> Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
> Reproducible with a single-node Spark in standalone mode.
> Reproducible with Zepellin or Spark shell.
>  
>Reporter: Piotr Kołaczkowski
>Assignee: zhengruifeng
>Priority: Major
> Attachments: 
> Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch
>
>
> Steps to reproduce:
> {noformat}
> import org.apache.spark.ml.linalg.Vectors
> import org.apache.spark.ml.classification.GBTClassifier
> case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int)
> sc.setCheckpointDir("/checkpoints")
> val trainingData = sc.parallelize(1 to 2426874, 256).map(x => 
> Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF
> val classifier = new GBTClassifier()
>   .setLabelCol("label")
>   .setFeaturesCol("features")
>   .setProbabilityCol("probability")
>   .setMaxIter(100)
>   .setMaxDepth(10)
>   .setCheckpointInterval(2)
> classifier.fit(trainingData){noformat}
>  
> The last line fails with:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 
> (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: 
> /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39)
> at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd

[jira] [Resolved] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier

2019-06-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27018.
---
   Resolution: Fixed
Fix Version/s: 2.4.4
   2.3.4
   3.0.0

Issue resolved by pull request 24870
[https://github.com/apache/spark/pull/24870]

> Checkpointed RDD deleted prematurely when using GBTClassifier
> -
>
> Key: SPARK-27018
> URL: https://issues.apache.org/jira/browse/SPARK-27018
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core
>Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0
> Environment: OS: Ubuntu Linux 18.10
> Java: java version "1.8.0_201"
> Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
> Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
> Reproducible with a single-node Spark in standalone mode.
> Reproducible with Zepellin or Spark shell.
>  
>Reporter: Piotr Kołaczkowski
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.0.0, 2.3.4, 2.4.4
>
> Attachments: 
> Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch
>
>
> Steps to reproduce:
> {noformat}
> import org.apache.spark.ml.linalg.Vectors
> import org.apache.spark.ml.classification.GBTClassifier
> case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int)
> sc.setCheckpointDir("/checkpoints")
> val trainingData = sc.parallelize(1 to 2426874, 256).map(x => 
> Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF
> val classifier = new GBTClassifier()
>   .setLabelCol("label")
>   .setFeaturesCol("features")
>   .setProbabilityCol("probability")
>   .setMaxIter(100)
>   .setMaxDepth(10)
>   .setCheckpointInterval(2)
> classifier.fit(trainingData){noformat}
>  
> The last line fails with:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 
> (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: 
> /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39)
> at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)

[jira] [Assigned] (SPARK-27989) Add retries on the connection to the driver

2019-06-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27989:
-

Assignee: Jose Luis Pedrosa

> Add retries on the connection to the driver
> ---
>
> Key: SPARK-27989
> URL: https://issues.apache.org/jira/browse/SPARK-27989
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Jose Luis Pedrosa
>Assignee: Jose Luis Pedrosa
>Priority: Minor
>
>  
> Any failure in the executor when trying to connect to the driver, will make 
> impossible a connection from that process, which will trigger the creation of 
> another executor scheduled.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27989) Add retries on the connection to the driver

2019-06-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27989.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24702
[https://github.com/apache/spark/pull/24702]

> Add retries on the connection to the driver
> ---
>
> Key: SPARK-27989
> URL: https://issues.apache.org/jira/browse/SPARK-27989
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Jose Luis Pedrosa
>Assignee: Jose Luis Pedrosa
>Priority: Minor
> Fix For: 3.0.0
>
>
>  
> Any failure in the executor when trying to connect to the driver, will make 
> impossible a connection from that process, which will trigger the creation of 
> another executor scheduled.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27546) Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone

2019-06-24 Thread Jiatao Tao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871213#comment-16871213
 ] 

Jiatao Tao commented on SPARK-27546:


Hello [~dongjoon]

Anyone ?

> Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone
> -
>
> Key: SPARK-27546
> URL: https://issues.apache.org/jira/browse/SPARK-27546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jiatao Tao
>Priority: Minor
> Attachments: image-2019-04-23-08-10-00-475.png, 
> image-2019-04-23-08-10-50-247.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28142) KafkaContinuousStream ignores user configuration on CONSUMER_POLL_TIMEOUT

2019-06-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28142.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24942
[https://github.com/apache/spark/pull/24942]

> KafkaContinuousStream ignores user configuration on CONSUMER_POLL_TIMEOUT
> -
>
> Key: SPARK-28142
> URL: https://issues.apache.org/jira/browse/SPARK-28142
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> KafkaContinuousStream has a bug where {{pollTimeoutMs}} is always set to 
> default value, as the value of {{KafkaSourceProvider.CONSUMER_POLL_TIMEOUT}} 
> is {{kafkaConsumer.pollTimeoutMs}} which key-lowercased map has been provided 
> as {{sourceOptions}}.
> This is due to the missing spot, which Map should be passed as 
> CaseInsensitiveStringMap as  SPARK-27106.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28142) KafkaContinuousStream ignores user configuration on CONSUMER_POLL_TIMEOUT

2019-06-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28142:


Assignee: Jungtaek Lim

> KafkaContinuousStream ignores user configuration on CONSUMER_POLL_TIMEOUT
> -
>
> Key: SPARK-28142
> URL: https://issues.apache.org/jira/browse/SPARK-28142
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> KafkaContinuousStream has a bug where {{pollTimeoutMs}} is always set to 
> default value, as the value of {{KafkaSourceProvider.CONSUMER_POLL_TIMEOUT}} 
> is {{kafkaConsumer.pollTimeoutMs}} which key-lowercased map has been provided 
> as {{sourceOptions}}.
> This is due to the missing spot, which Map should be passed as 
> CaseInsensitiveStringMap as  SPARK-27106.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6107) event log file ends with .inprogress should be able to display on webUI for standalone mode

2019-06-24 Thread leo.zhi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871005#comment-16871005
 ] 

leo.zhi edited comment on SPARK-6107 at 6/24/19 9:27 AM:
-

in version 2.2.0-cdh6.0.1 and yarn cluster, it still happens.:(


was (Author: leo.zhi):
in version 2.2.0-cdh6.0.1, it still happens.:(

> event log file ends with .inprogress should be able to display on webUI for 
> standalone mode
> ---
>
> Key: SPARK-6107
> URL: https://issues.apache.org/jira/browse/SPARK-6107
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.1
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
>Priority: Major
> Fix For: 1.4.0
>
>
> when application is finished running abnormally (Ctrl + c for example), the 
> history event log file is still ends with *.inprogress* suffix. And the 
> application state can not be showed on webUI, User can just see "*Application 
> history not foud , Application xxx is still in progress*".  
> User should also can see the status of the abnormal finished applications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6107) event log file ends with .inprogress should be able to display on webUI for standalone mode

2019-06-24 Thread leo.zhi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871005#comment-16871005
 ] 

leo.zhi commented on SPARK-6107:


in version 2.2.0-cdh6.0.1, it still happens.:(

> event log file ends with .inprogress should be able to display on webUI for 
> standalone mode
> ---
>
> Key: SPARK-6107
> URL: https://issues.apache.org/jira/browse/SPARK-6107
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.1
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
>Priority: Major
> Fix For: 1.4.0
>
>
> when application is finished running abnormally (Ctrl + c for example), the 
> history event log file is still ends with *.inprogress* suffix. And the 
> application state can not be showed on webUI, User can just see "*Application 
> history not foud , Application xxx is still in progress*".  
> User should also can see the status of the abnormal finished applications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28144) Remove ZKUtils from Kafka tests

2019-06-24 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870961#comment-16870961
 ] 

Gabor Somogyi commented on SPARK-28144:
---

ZookeeperClient has "private [kafka]" visibility and with reflection it can be 
used but I think adding another hacky solution which maybe doesn't reflect how 
Kafka addresses this issue is bad in general.

> Remove ZKUtils from Kafka tests
> ---
>
> Key: SPARK-28144
> URL: https://issues.apache.org/jira/browse/SPARK-28144
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, Tests
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> ZKUtils is deprecated from Kafka version 2.0.0 so it would be good to replace.
> I've taken a look at the possibilities but seems like there is no working 
> alternative at the moment. Please see KAFKA-8468.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28144) Remove ZKUtils from Kafka tests

2019-06-24 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870959#comment-16870959
 ] 

Gabor Somogyi commented on SPARK-28144:
---

I've made the implementation but it doesn't work until the mentioned bug is not 
fixed.

> Remove ZKUtils from Kafka tests
> ---
>
> Key: SPARK-28144
> URL: https://issues.apache.org/jira/browse/SPARK-28144
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, Tests
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> ZKUtils is deprecated from Kafka version 2.0.0 so it would be good to replace.
> I've taken a look at the possibilities but seems like there is no working 
> alternative at the moment. Please see KAFKA-8468.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28144) Remove ZKUtils from Kafka tests

2019-06-24 Thread Gabor Somogyi (JIRA)
Gabor Somogyi created SPARK-28144:
-

 Summary: Remove ZKUtils from Kafka tests
 Key: SPARK-28144
 URL: https://issues.apache.org/jira/browse/SPARK-28144
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming, Tests
Affects Versions: 3.0.0
Reporter: Gabor Somogyi


ZKUtils is deprecated from Kafka version 2.0.0 so it would be good to replace.
I've taken a look at the possibilities but seems like there is no working 
alternative at the moment. Please see KAFKA-8468.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

2019-06-24 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870880#comment-16870880
 ] 

Marco Gaido commented on SPARK-28067:
-

No, it is the same. Are you sure about your configs?

{code}
macmarco:spark mark9$ git log -5 --oneline
5ad1053f3e (HEAD, apache/master) [SPARK-28128][PYTHON][SQL] Pandas Grouped UDFs 
skip empty partitions
113f8c8d13 [SPARK-28132][PYTHON] Update document type conversion for Pandas 
UDFs (pyarrow 0.13.0, pandas 0.24.2, Python 3.7)
9b9d81b821 [SPARK-28131][PYTHON] Update document type conversion between Python 
data and SQL types in normal UDFs (Python 3.7)
54da3bbfb2 [SPARK-28127][SQL] Micro optimization on TreeNode's mapChildren 
method
47f54b1ec7 [SPARK-28118][CORE] Add `spark.eventLog.compression.codec` 
configuration
macmarco:spark mark9$ ./bin/spark-shell 
19/06/24 09:17:58 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://.:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1561360686725).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
  /_/
 
Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = Seq(
 |  (BigDecimal("1000"), 1),
 |  (BigDecimal("1000"), 1),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]

scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
"intNum").agg(sum("decNum"))
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]

scala> df2.explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(decNum#14)])
+- Exchange SinglePartition
   +- *(1) HashAggregate(keys=[], functions=[partial_sum(decNum#14)])
  +- *(1) Project [decNum#14]
 +- *(1) BroadcastHashJoin [intNum#8], [intNum#15], Inner, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
int, false] as bigint)))
:  +- LocalTableScan [intNum#8]
+- LocalTableScan [decNum#14, intNum#15]



scala> df2.show(40,false)
+---+   
|sum(decNum)|
+---+
|null   |
+---+
{code}

> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --
>
> Key: SPARK-28067
> URL: https://issues.apache.org/jira/browse/SPARK-28067
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
> Environment: Ubuntu LTS 16.04
> Oracle Java 1.8.0_201
> spark-2.4.3-bin-without-hadoop
> spark-shell
>Reporter: Mark Sirek
>Priority: Minor
>  Labels: correctness
>
> The following test case involving a join followed by a sum aggregation 
> returns the wrong answer for the sum:
>  
> {code:java}
> val df = Seq(
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
> "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
>  ---
> sum(decNum)
> ---
> 4000.00
> ---
>  
> {code}
>  
> The result should be 1040