date:20200131

[jira] [Assigned] (SPARK-30700) NaiveBayesModel predict optimization

2020-01-31 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30700:


Assignee: zhengruifeng

> NaiveBayesModel predict optimization
> 
>
> Key: SPARK-30700
> URL: https://issues.apache.org/jira/browse/SPARK-30700
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
>
> var negThetaSum is always used together with pi, so we can add them at first



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30700) NaiveBayesModel predict optimization

2020-01-31 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30700.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27427
[https://github.com/apache/spark/pull/27427]

> NaiveBayesModel predict optimization
> 
>
> Key: SPARK-30700
> URL: https://issues.apache.org/jira/browse/SPARK-30700
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 3.0.0
>
>
> var negThetaSum is always used together with pi, so we can add them at first



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29138) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy

2020-01-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29138.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27424
[https://github.com/apache/spark/pull/27424]

> Flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy
> -
>
> Key: SPARK-29138
> URL: https://issues.apache.org/jira/browse/SPARK-29138
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/110686/testReport/]
> {code:java}
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 266, in test_parameter_accuracy
> self._eventually(condition, catch_assertions=True)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
> raise lastValue
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 263, in condition
> self.assertAlmostEqual(rel, 0.1, 1)
> AssertionError: 0.17619737864096185 != 0.1 within 1 places {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29138) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy

2020-01-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-29138:


Assignee: Dongjoon Hyun

> Flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy
> -
>
> Key: SPARK-29138
> URL: https://issues.apache.org/jira/browse/SPARK-29138
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Dongjoon Hyun
>Priority: Major
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/110686/testReport/]
> {code:java}
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 266, in test_parameter_accuracy
> self._eventually(condition, catch_assertions=True)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
> raise lastValue
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 263, in condition
> self.assertAlmostEqual(rel, 0.1, 1)
> AssertionError: 0.17619737864096185 != 0.1 within 1 places {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30700) NaiveBayesModel predict optimization

2020-01-31 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-30700:


 Summary: NaiveBayesModel predict optimization
 Key: SPARK-30700
 URL: https://issues.apache.org/jira/browse/SPARK-30700
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


var negThetaSum is always used together with pi, so we can add them at first



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30699) GMM blockify input vectors

2020-01-31 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-30699:


 Summary: GMM blockify input vectors
 Key: SPARK-30699
 URL: https://issues.apache.org/jira/browse/SPARK-30699
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.1.0
Reporter: zhengruifeng
Assignee: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp

2020-01-31 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027975#comment-17027975
 ] 

Maxim Gekk edited comment on SPARK-30696 at 2/1/20 5:28 AM:


[~dongjoon] We have different default time zones, maybe it depends on this. For 
example, code there uses the system time zone 
[https://github.com/apache/spark/blob/de21f28f8a0a41dd7eb8ed1ff8b35a6d7538958b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L806-L807]


was (Author: maxgekk):
[~dongjoon] We have different default time zones, maybe it depends on this.

> Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
> --
>
> Key: SPARK-30696
> URL: https://issues.apache.org/jira/browse/SPARK-30696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Applying to_utc_timestamp() to results of from_utc_timestamp() should return 
> the original timestamp in the same time zone. In the range of 100 years, the 
> combination of functions returns wrong results 280 times out of 1753200:
> {code:java}
> scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100
> SECS_PER_YEAR: Long = 31557600
> scala> val SECS_PER_MINUTE = 60L
> SECS_PER_MINUTE: Long = 60
> scala>  val tz = "America/Los_Angeles"
> tz: String = America/Los_Angeles
> scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * 
> SECS_PER_MINUTE)
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val diff = 
> df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), 
> tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0)
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint]
> scala> diff.count
> res14: Long = 280
> scala> df.count
> res15: Long = 1753200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp

2020-01-31 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027975#comment-17027975
 ] 

Maxim Gekk commented on SPARK-30696:


[~dongjoon] We have different default time zones, maybe it depends on this.

> Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
> --
>
> Key: SPARK-30696
> URL: https://issues.apache.org/jira/browse/SPARK-30696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Applying to_utc_timestamp() to results of from_utc_timestamp() should return 
> the original timestamp in the same time zone. In the range of 100 years, the 
> combination of functions returns wrong results 280 times out of 1753200:
> {code:java}
> scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100
> SECS_PER_YEAR: Long = 31557600
> scala> val SECS_PER_MINUTE = 60L
> SECS_PER_MINUTE: Long = 60
> scala>  val tz = "America/Los_Angeles"
> tz: String = America/Los_Angeles
> scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * 
> SECS_PER_MINUTE)
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val diff = 
> df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), 
> tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0)
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint]
> scala> diff.count
> res14: Long = 280
> scala> df.count
> res15: Long = 1753200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28344) fail the query if detect ambiguous self join

2020-01-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027973#comment-17027973
 ] 

Dongjoon Hyun commented on SPARK-28344:
---

Please refer the discussion on our backporting efforts.
- https://github.com/apache/spark/pull/27417

I removed `Target Version: 2.4.5` for now. If we need this in `branch-2.4` 
later, `Target Version` will be `2.4.6`.

> fail the query if detect ambiguous self join
> 
>
> Key: SPARK-28344
> URL: https://issues.apache.org/jira/browse/SPARK-28344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28344) fail the query if detect ambiguous self join

2020-01-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28344:
--
Target Version/s: 3.0.0  (was: 2.4.5, 3.0.0)

> fail the query if detect ambiguous self join
> 
>
> Key: SPARK-28344
> URL: https://issues.apache.org/jira/browse/SPARK-28344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30698) Bumps checkstyle from 8.25 to 8.29.

2020-01-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30698:
-

Assignee: jiaan.geng

> Bumps checkstyle from 8.25 to 8.29.
> ---
>
> Key: SPARK-30698
> URL: https://issues.apache.org/jira/browse/SPARK-30698
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> I found checkstyle have a new release 
> [https://checkstyle.org/releasenotes.html#Release_8.29]
> Bumps checkstyle from 8.25 to 8.29.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30698) Bumps checkstyle from 8.25 to 8.29.

2020-01-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30698.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27426
[https://github.com/apache/spark/pull/27426]

> Bumps checkstyle from 8.25 to 8.29.
> ---
>
> Key: SPARK-30698
> URL: https://issues.apache.org/jira/browse/SPARK-30698
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.0.0
>
>
> I found checkstyle have a new release 
> [https://checkstyle.org/releasenotes.html#Release_8.29]
> Bumps checkstyle from 8.25 to 8.29.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30689) Allow custom resource scheduling to work with YARN versions that don't support custom resource scheduling

2020-01-31 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-30689.
---
Fix Version/s: 3.0.0
 Assignee: Thomas Graves
   Resolution: Fixed

> Allow custom resource scheduling to work with YARN versions that don't 
> support custom resource scheduling
> -
>
> Key: SPARK-30689
> URL: https://issues.apache.org/jira/browse/SPARK-30689
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>
> Many people/companies will not be moving to Hadoop 3.1 or greater, where it 
> supports custom resource scheduling for things like GPUs soon and have 
> requested support for it in older hadoop 2.x versions. This also means that 
> they may not have isolation enabled which is what the default behavior relies 
> on.
> right now the option is to write a custom discovery script to handle on their 
> own. This is ok but has some limitation because the script runs as a separate 
> process.  It also just a shell script.
> I think we can make this a lot more flexible by making the entire resource 
> discovery class pluggable. The default one would stay as is and call the 
> discovery script, but if an advanced user wanted to replace the entire thing 
> they could implement a pluggable class which they could write custom code on 
> how to discovery resource addresses.
> This will also help users if they are running hadoop 3.1.x or greater but 
> don't have the resources configured or aren't running in an isolated 
> environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27946) Hive DDL to Spark DDL conversion USING "show create table"

2020-01-31 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27946:

Description: 
This patch adds a DDL command SHOW CREATE TABLE AS SERDE. It is used to 
generate Hive DDL for a Hive table.

For original SHOW CREATE TABLE, it now shows Spark DDL always. If given a Hive 
table, it tries to generate Spark DDL.

For Hive serde to data source conversion, this uses the existing mapping inside 
HiveSerDe. If can't find a mapping there, throws an analysis exception on 
unsupported serde configuration.

It is arguably that some Hive fileformat + row serde might be mapped to Spark 
data source, e.g., CSV. It is not included in this PR. To be conservative, it 
may not be supported.

For Hive serde properties, for now this doesn't save it to Spark DDL because it 
may not useful to keep Hive serde properties in Spark table.

  was:Many users migrate tables created with Hive DDL to Spark. Defining the 
table with Spark DDL brings performance benefits. We need to add a feature to 
Show Create Table that allows you to generate Spark DDL for a table. For 
example: `SHOW CREATE TABLE customers AS SPARK`.


> Hive DDL to Spark DDL conversion USING "show create table"
> --
>
> Key: SPARK-27946
> URL: https://issues.apache.org/jira/browse/SPARK-27946
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> This patch adds a DDL command SHOW CREATE TABLE AS SERDE. It is used to 
> generate Hive DDL for a Hive table.
> For original SHOW CREATE TABLE, it now shows Spark DDL always. If given a 
> Hive table, it tries to generate Spark DDL.
> For Hive serde to data source conversion, this uses the existing mapping 
> inside HiveSerDe. If can't find a mapping there, throws an analysis exception 
> on unsupported serde configuration.
> It is arguably that some Hive fileformat + row serde might be mapped to Spark 
> data source, e.g., CSV. It is not included in this PR. To be conservative, it 
> may not be supported.
> For Hive serde properties, for now this doesn't save it to Spark DDL because 
> it may not useful to keep Hive serde properties in Spark table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27946) Hive DDL to Spark DDL conversion USING "show create table"

2020-01-31 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27946.
-
Fix Version/s: 3.0.0
 Assignee: L. C. Hsieh
   Resolution: Fixed

> Hive DDL to Spark DDL conversion USING "show create table"
> --
>
> Key: SPARK-27946
> URL: https://issues.apache.org/jira/browse/SPARK-27946
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> This patch adds a DDL command SHOW CREATE TABLE AS SERDE. It is used to 
> generate Hive DDL for a Hive table.
> For original SHOW CREATE TABLE, it now shows Spark DDL always. If given a 
> Hive table, it tries to generate Spark DDL.
> For Hive serde to data source conversion, this uses the existing mapping 
> inside HiveSerDe. If can't find a mapping there, throws an analysis exception 
> on unsupported serde configuration.
> It is arguably that some Hive fileformat + row serde might be mapped to Spark 
> data source, e.g., CSV. It is not included in this PR. To be conservative, it 
> may not be supported.
> For Hive serde properties, for now this doesn't save it to Spark DDL because 
> it may not useful to keep Hive serde properties in Spark table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30660) LinearRegression blockify input vectors

2020-01-31 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30660.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27396
[https://github.com/apache/spark/pull/27396]

> LinearRegression blockify input vectors
> ---
>
> Key: SPARK-30660
> URL: https://issues.apache.org/jira/browse/SPARK-30660
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30660) LinearRegression blockify input vectors

2020-01-31 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30660:


Assignee: zhengruifeng

> LinearRegression blockify input vectors
> ---
>
> Key: SPARK-30660
> URL: https://issues.apache.org/jira/browse/SPARK-30660
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30698) Bumps checkstyle from 8.25 to 8.29.

2020-01-31 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-30698:
--

 Summary: Bumps checkstyle from 8.25 to 8.29.
 Key: SPARK-30698
 URL: https://issues.apache.org/jira/browse/SPARK-30698
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build
Affects Versions: 3.0.0
Reporter: jiaan.geng


I found checkstyle have a new release 
[https://checkstyle.org/releasenotes.html#Release_8.29]

Bumps checkstyle from 8.25 to 8.29.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29138) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy

2020-01-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027930#comment-17027930
 ] 

Dongjoon Hyun commented on SPARK-29138:
---

I saw this in three places (2 independent PRs and `Hadoop 2.7+Hive1.2` Jenkins 
job).
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/lastCompletedBuild/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_parameter_accuracy/

cc [~tgraves]. This is the one we are hitting today. When I try this in JDK11 
environment, this is not reproduced. So, the flakiness is not about JDK version.

https://github.com/apache/spark/pull/27424

> Flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy
> -
>
> Key: SPARK-29138
> URL: https://issues.apache.org/jira/browse/SPARK-29138
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/110686/testReport/]
> {code:java}
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 266, in test_parameter_accuracy
> self._eventually(condition, catch_assertions=True)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
> raise lastValue
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 263, in condition
> self.assertAlmostEqual(rel, 0.1, 1)
> AssertionError: 0.17619737864096185 != 0.1 within 1 places {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30695) Upgrade Apache ORC to 1.5.9

2020-01-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30695.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27421
[https://github.com/apache/spark/pull/27421]

> Upgrade Apache ORC to 1.5.9
> ---
>
> Key: SPARK-30695
> URL: https://issues.apache.org/jira/browse/SPARK-30695
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue aims to update Apache ORC dependency to the latest version 1.5.9.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30658) Limit after on streaming dataframe before streaming agg returns wrong results

2020-01-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027913#comment-17027913
 ] 

Dongjoon Hyun commented on SPARK-30658:
---

Got it. Thank you for confirmation, [~tdas].

> Limit after on streaming dataframe before streaming agg returns wrong results
> -
>
> Key: SPARK-30658
> URL: https://issues.apache.org/jira/browse/SPARK-30658
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 3.0.0
>
>
> Limit before a streaming aggregate (i.e. {{df.limit(5).groupBy().count()}}) 
> in complete mode was not being planned as a streaming limit. The planner rule 
> planned a logical limit with a stateful streaming limit plan only if the 
> query is in append mode. As a result, instead of allowing max 5 rows across 
> batches, the planned streaming query was allowing 5 rows in every batch thus 
> producing incorrect results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30657) Streaming limit after streaming dropDuplicates can throw error

2020-01-31 Thread Tathagata Das (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027910#comment-17027910
 ] 

Tathagata Das commented on SPARK-30657:
---

This fix by itself (separate from the fix for SPARK-30658) may be backported. 
The solution that I did to always inject StreamingLocalLimitExec is safe from 
correctness point of view, but is a little risky from the performance point of 
view (which I tried to minimize using the optimization). With 2.4.4+, unless 
this is a serious bug that affects many users, I dont think we should backport 
this. And i dont think limit on streaming is that extensively used such that 
this is big bug (it has not been reported for 1.5 years). 

What do you think [~zsxwing]

> Streaming limit after streaming dropDuplicates can throw error
> --
>
> Key: SPARK-30657
> URL: https://issues.apache.org/jira/browse/SPARK-30657
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 3.0.0
>
>
> {{LocalLimitExec}} does not consume the iterator of the child plan. So if 
> there is a limit after a stateful operator like streaming dedup in append 
> mode (e.g. {{streamingdf.dropDuplicates().limit(5}})), the state changes of 
> streaming duplicate may not be committed (most stateful ops commit state 
> changes only after the generated iterator is fully consumed). This leads to 
> the next batch failing with {{java.lang.IllegalStateException: Error reading 
> delta file .../N.delta does not exist}} as the state store delta file was 
> never generated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30658) Limit after on streaming dataframe before streaming agg returns wrong results

2020-01-31 Thread Tathagata Das (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027906#comment-17027906
 ] 

Tathagata Das commented on SPARK-30658:
---

I am a little afraid to backport this because this is hacky change in the 
incremental planner which is already quite complicated to reason about

> Limit after on streaming dataframe before streaming agg returns wrong results
> -
>
> Key: SPARK-30658
> URL: https://issues.apache.org/jira/browse/SPARK-30658
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 3.0.0
>
>
> Limit before a streaming aggregate (i.e. {{df.limit(5).groupBy().count()}}) 
> in complete mode was not being planned as a streaming limit. The planner rule 
> planned a logical limit with a stateful streaming limit plan only if the 
> query is in append mode. As a result, instead of allowing max 5 rows across 
> batches, the planned streaming query was allowing 5 rows in every batch thus 
> producing incorrect results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30658) Limit after on streaming dataframe before streaming agg returns wrong results

2020-01-31 Thread Tathagata Das (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027907#comment-17027907
 ] 

Tathagata Das commented on SPARK-30658:
---

Fixed in this PR https://github.com/apache/spark/pull/27373

> Limit after on streaming dataframe before streaming agg returns wrong results
> -
>
> Key: SPARK-30658
> URL: https://issues.apache.org/jira/browse/SPARK-30658
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 3.0.0
>
>
> Limit before a streaming aggregate (i.e. {{df.limit(5).groupBy().count()}}) 
> in complete mode was not being planned as a streaming limit. The planner rule 
> planned a logical limit with a stateful streaming limit plan only if the 
> query is in append mode. As a result, instead of allowing max 5 rows across 
> batches, the planned streaming query was allowing 5 rows in every batch thus 
> producing incorrect results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30657) Streaming limit after streaming dropDuplicates can throw error

2020-01-31 Thread Tathagata Das (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-30657.
---
Resolution: Fixed

> Streaming limit after streaming dropDuplicates can throw error
> --
>
> Key: SPARK-30657
> URL: https://issues.apache.org/jira/browse/SPARK-30657
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 3.0.0
>
>
> {{LocalLimitExec}} does not consume the iterator of the child plan. So if 
> there is a limit after a stateful operator like streaming dedup in append 
> mode (e.g. {{streamingdf.dropDuplicates().limit(5}})), the state changes of 
> streaming duplicate may not be committed (most stateful ops commit state 
> changes only after the generated iterator is fully consumed). This leads to 
> the next batch failing with {{java.lang.IllegalStateException: Error reading 
> delta file .../N.delta does not exist}} as the state store delta file was 
> never generated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30658) Limit after on streaming dataframe before streaming agg returns wrong results

2020-01-31 Thread Tathagata Das (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-30658.
---
Resolution: Fixed

> Limit after on streaming dataframe before streaming agg returns wrong results
> -
>
> Key: SPARK-30658
> URL: https://issues.apache.org/jira/browse/SPARK-30658
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 3.0.0
>
>
> Limit before a streaming aggregate (i.e. {{df.limit(5).groupBy().count()}}) 
> in complete mode was not being planned as a streaming limit. The planner rule 
> planned a logical limit with a stateful streaming limit plan only if the 
> query is in append mode. As a result, instead of allowing max 5 rows across 
> batches, the planned streaming query was allowing 5 rows in every batch thus 
> producing incorrect results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30658) Limit after on streaming dataframe before streaming agg returns wrong results

2020-01-31 Thread Tathagata Das (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-30658:
-

Assignee: Tathagata Das

> Limit after on streaming dataframe before streaming agg returns wrong results
> -
>
> Key: SPARK-30658
> URL: https://issues.apache.org/jira/browse/SPARK-30658
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 3.0.0
>
>
> Limit before a streaming aggregate (i.e. [[df.limit(5).groupBy().count()}}) 
> in complete mode was not being planned as a streaming limit. The planner rule 
> planned a logical limit with a stateful streaming limit plan only if the 
> query is in append mode. As a result, instead of allowing max 5 rows across 
> batches, the planned streaming query was allowing 5 rows in every batch thus 
> producing incorrect results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30658) Limit after on streaming dataframe before streaming agg returns wrong results

2020-01-31 Thread Tathagata Das (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-30658:
--
Description: Limit before a streaming aggregate (i.e. 
{{df.limit(5).groupBy().count()}}) in complete mode was not being planned as a 
streaming limit. The planner rule planned a logical limit with a stateful 
streaming limit plan only if the query is in append mode. As a result, instead 
of allowing max 5 rows across batches, the planned streaming query was allowing 
5 rows in every batch thus producing incorrect results.  (was: Limit before a 
streaming aggregate (i.e. [[df.limit(5).groupBy().count()}}) in complete mode 
was not being planned as a streaming limit. The planner rule planned a logical 
limit with a stateful streaming limit plan only if the query is in append mode. 
As a result, instead of allowing max 5 rows across batches, the planned 
streaming query was allowing 5 rows in every batch thus producing incorrect 
results.)

> Limit after on streaming dataframe before streaming agg returns wrong results
> -
>
> Key: SPARK-30658
> URL: https://issues.apache.org/jira/browse/SPARK-30658
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 3.0.0
>
>
> Limit before a streaming aggregate (i.e. {{df.limit(5).groupBy().count()}}) 
> in complete mode was not being planned as a streaming limit. The planner rule 
> planned a logical limit with a stateful streaming limit plan only if the 
> query is in append mode. As a result, instead of allowing max 5 rows across 
> batches, the planned streaming query was allowing 5 rows in every batch thus 
> producing incorrect results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-01-31 Thread Rajkumar Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajkumar Singh updated SPARK-30688:
---
Description: 
 
{code:java}
scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
+-+
|unix_timestamp(20201, ww)|
+-+
|                         null|
+-+
 
scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
-+
|unix_timestamp(20202, ww)|
+-+
|                   1578182400|
+-+
 
{code}
 

 

This seems to happen for leap year only, I dig deeper into it and it seems that 
 Spark is using the java.text.SimpleDateFormat and try to parse the expression 
here

[org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
{code:java}
formatter.parse(
 t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
 but fail and SimpleDateFormat unable to parse the date throw Unparseable 
Exception but Spark handle it silently and returns NULL.

 

*Spark-3.0:* I did some tests where spark no longer using the legacy 
java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
expect a valid date with valid format

 org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse

  was:
 
{code:java}
scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
+-+
|unix_timestamp(20201, ww)|
+-+
|                         null|
+-+
 
scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
-+
|unix_timestamp(20202, ww)|
+-+
|                   1578182400|
+-+
 
{code}
 

 

This seems to happen for leap year only, I dig deeper into it and it seems that 
 Spark is using the java.text.SimpleDateFormat and try to parse the expression 
here

[org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
{code:java}
formatter.parse(
 t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
 but fail and SimpleDateFormat unable to parse the date throw Unparseable 
Exception but Spark handle it silently and returns NULL.

 


> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  
> *Spark-3.0:* I did some tests where spark no longer using the legacy 
> java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
> expect a valid date with valid format
>  org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30697) Handle database and namespace exceptions in catalog.isView

2020-01-31 Thread Burak Yavuz (Jira)

Burak Yavuz created SPARK-30697:
---

 Summary: Handle database and namespace exceptions in catalog.isView
 Key: SPARK-30697
 URL: https://issues.apache.org/jira/browse/SPARK-30697
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


The non-existence of a database shouldn't throw a NoSuchDatabaseException from 
catalog.isView



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp

2020-01-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027885#comment-17027885
 ] 

Dongjoon Hyun commented on SPARK-30696:
---

I got the following numbers from the example in the JIRA. I'm wondering why 
it's 280 in the JIRA description.
{code:java}
scala> sc.version
res5: String = 2.0.2

scala> diff.count
res6: Long = 144
{code}

{code}
scala> sc.version
res1: String = 2.1.3

scala> diff.count
res2: Long = 144
{code} 

{code}
scala> sc.version
res1: String = 2.2.3

scala> diff.count
res2: Long = 144
{code}

{code}
scala> sc.version
res1: String = 2.3.4

scala> diff.count
res2: Long = 144
{code}

{code}
scala> sc.version
res1: String = 2.4.4

scala> diff.count
res2: Long = 144
{code}
 

> Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
> --
>
> Key: SPARK-30696
> URL: https://issues.apache.org/jira/browse/SPARK-30696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Applying to_utc_timestamp() to results of from_utc_timestamp() should return 
> the original timestamp in the same time zone. In the range of 100 years, the 
> combination of functions returns wrong results 280 times out of 1753200:
> {code:java}
> scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100
> SECS_PER_YEAR: Long = 31557600
> scala> val SECS_PER_MINUTE = 60L
> SECS_PER_MINUTE: Long = 60
> scala>  val tz = "America/Los_Angeles"
> tz: String = America/Los_Angeles
> scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * 
> SECS_PER_MINUTE)
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val diff = 
> df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), 
> tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0)
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint]
> scala> diff.count
> res14: Long = 280
> scala> df.count
> res15: Long = 1753200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp

2020-01-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027883#comment-17027883
 ] 

Dongjoon Hyun commented on SPARK-30696:
---

Thank you for pinging me.

> Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
> --
>
> Key: SPARK-30696
> URL: https://issues.apache.org/jira/browse/SPARK-30696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Applying to_utc_timestamp() to results of from_utc_timestamp() should return 
> the original timestamp in the same time zone. In the range of 100 years, the 
> combination of functions returns wrong results 280 times out of 1753200:
> {code:java}
> scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100
> SECS_PER_YEAR: Long = 31557600
> scala> val SECS_PER_MINUTE = 60L
> SECS_PER_MINUTE: Long = 60
> scala>  val tz = "America/Los_Angeles"
> tz: String = America/Los_Angeles
> scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * 
> SECS_PER_MINUTE)
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val diff = 
> df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), 
> tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0)
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint]
> scala> diff.count
> res14: Long = 280
> scala> df.count
> res15: Long = 1753200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30508) Add DataFrameReader.executeCommand API for external datasource

2020-01-31 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-30508.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Add DataFrameReader.executeCommand API for external datasource
> --
>
> Key: SPARK-30508
> URL: https://issues.apache.org/jira/browse/SPARK-30508
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Add DataFrameReader.executeCommand API for external datasource in order to 
> make external datasources be able to execute some custom DDL/DML commands.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30508) Add DataFrameReader.executeCommand API for external datasource

2020-01-31 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-30508:
---

Assignee: wuyi

> Add DataFrameReader.executeCommand API for external datasource
> --
>
> Key: SPARK-30508
> URL: https://issues.apache.org/jira/browse/SPARK-30508
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> Add DataFrameReader.executeCommand API for external datasource in order to 
> make external datasources be able to execute some custom DDL/DML commands.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30614) The native ALTER COLUMN syntax should change one thing at a time

2020-01-31 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027878#comment-17027878
 ] 

Terry Kim commented on SPARK-30614:
---

Currently, ALTER COLUMN requires TYPE property for v1 tables: 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L80]

For example, you have to run 
{code:sql}
ALTER TABLE t ALTER COLUMN i TYPE bigint COMMENT 'new comment'
{code}
but not
{code:sql}
ALTER TABLE $tblName ALTER COLUMN i COMMENT 'new comment'
{code}

It's OK to break this behavior, right? (we can deduce the column type since the 
table is already looked up.)

> The native ALTER COLUMN syntax should change one thing at a time
> 
>
> Key: SPARK-30614
> URL: https://issues.apache.org/jira/browse/SPARK-30614
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Our native ALTER COLUMN syntax is newly added in 3.0 and almost follows the 
> SQL standard.
> {code}
> ALTER TABLE table=multipartIdentifier
>   (ALTER | CHANGE) COLUMN? column=multipartIdentifier
>   (TYPE dataType)?
>   (COMMENT comment=STRING)?
>   colPosition?   
> {code}
> The SQL standard (section 11.12) only allows changing one property at a time. 
> This is also true on other recent SQL systems like 
> snowflake(https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html)
>  and 
> redshift(https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html)
> The snowflake has an extension that it allows changing multiple columns at a 
> time, like ALTER COLUMN c1 TYPE int, c2 TYPE int. If we want to extend the 
> SQL standard, I think this syntax is better. 
> For now, let's be conservative and only allow changing one property at a time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30695) Upgrade Apache ORC to 1.5.9

2020-01-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30695:
-

Assignee: Dongjoon Hyun

> Upgrade Apache ORC to 1.5.9
> ---
>
> Key: SPARK-30695
> URL: https://issues.apache.org/jira/browse/SPARK-30695
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This issue aims to update Apache ORC dependency to the latest version 1.5.9.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp

2020-01-31 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027834#comment-17027834
 ] 

Maxim Gekk commented on SPARK-30696:


The issue can be reproduced when DateTimeUtils functions are invoked directly 
w/o casting:
{code}
var ts = -50 * MICROS_PER_YEAR
val maxTs = 50 * MICROS_PER_YEAR
val step = 30 * MICROS_PER_MINUTE
val tz = "America/Los_Angeles"
var incorrectCount = 0
while (ts <= maxTs) {
  if (toUTCTime(fromUTCTime(ts, tz), tz) != ts) {
incorrectCount += 1
  }
  ts += step
}
println(incorrectCount)
{code}

> Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
> --
>
> Key: SPARK-30696
> URL: https://issues.apache.org/jira/browse/SPARK-30696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Applying to_utc_timestamp() to results of from_utc_timestamp() should return 
> the original timestamp in the same time zone. In the range of 100 years, the 
> combination of functions returns wrong results 280 times out of 1753200:
> {code:java}
> scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100
> SECS_PER_YEAR: Long = 31557600
> scala> val SECS_PER_MINUTE = 60L
> SECS_PER_MINUTE: Long = 60
> scala>  val tz = "America/Los_Angeles"
> tz: String = America/Los_Angeles
> scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * 
> SECS_PER_MINUTE)
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val diff = 
> df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), 
> tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0)
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint]
> scala> diff.count
> res14: Long = 280
> scala> df.count
> res15: Long = 1753200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp

2020-01-31 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027830#comment-17027830
 ] 

Maxim Gekk commented on SPARK-30696:


[~dongjoon] FYI

> Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
> --
>
> Key: SPARK-30696
> URL: https://issues.apache.org/jira/browse/SPARK-30696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Applying to_utc_timestamp() to results of from_utc_timestamp() should return 
> the original timestamp in the same time zone. In the range of 100 years, the 
> combination of functions returns wrong results 280 times out of 1753200:
> {code:java}
> scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100
> SECS_PER_YEAR: Long = 31557600
> scala> val SECS_PER_MINUTE = 60L
> SECS_PER_MINUTE: Long = 60
> scala>  val tz = "America/Los_Angeles"
> tz: String = America/Los_Angeles
> scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * 
> SECS_PER_MINUTE)
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val diff = 
> df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), 
> tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0)
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint]
> scala> diff.count
> res14: Long = 280
> scala> df.count
> res15: Long = 1753200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp

2020-01-31 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30696:
--

 Summary: Wrong result of the combination of from_utc_timestamp and 
to_utc_timestamp
 Key: SPARK-30696
 URL: https://issues.apache.org/jira/browse/SPARK-30696
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Maxim Gekk


Applying to_utc_timestamp() to results of from_utc_timestamp() should return 
the original timestamp in the same time zone. In the range of 100 years, the 
combination of functions returns wrong results 280 times out of 1753200:
{code:java}
scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100
SECS_PER_YEAR: Long = 31557600

scala> val SECS_PER_MINUTE = 60L
SECS_PER_MINUTE: Long = 60

scala>  val tz = "America/Los_Angeles"
tz: String = America/Los_Angeles

scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * 
SECS_PER_MINUTE)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> val diff = 
df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), 
tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0)
warning: there was one deprecation warning; re-run with -deprecation for details
diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint]

scala> diff.count
res14: Long = 280

scala> df.count
res15: Long = 1753200
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30676) Eliminate warnings from deprecated constructors of java.lang.Integer and java.lang.Double

2020-01-31 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30676.
--
Fix Version/s: 3.0.0
 Assignee: Maxim Gekk
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/27399

> Eliminate warnings from deprecated constructors of java.lang.Integer and 
> java.lang.Double
> -
>
> Key: SPARK-30676
> URL: https://issues.apache.org/jira/browse/SPARK-30676
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The constructors of java.lang.Integer and java.lang.Double has been 
> deprecated already, see 
> [https://docs.oracle.com/javase/9/docs/api/java/lang/Integer.html]. The 
> following warnings are printed while compiling Spark:
> {code}
> 1. RDD.scala:240: constructor Integer in class Integer is deprecated: see 
> corresponding Javadoc for more information.
> 2. MutableProjectionSuite.scala:63: constructor Integer in class Integer is 
> deprecated: see corresponding Javadoc for more information.
> 3. UDFSuite.scala:446: constructor Integer in class Integer is deprecated: 
> see corresponding Javadoc for more information.
> 4. UDFSuite.scala:451: constructor Double in class Double is deprecated: see 
> corresponding Javadoc for more information.
> 5. HiveUserDefinedTypeSuite.scala:71: constructor Double in class Double is 
> deprecated: see corresponding Javadoc for more information.
> {code}
>  The ticket aims to replace the constructors by the valueOf methods, or maybe 
> by other ways.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27324) document configurations related to executor metrics

2020-01-31 Thread Imran Rashid (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-27324.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

> document configurations related to executor metrics
> ---
>
> Key: SPARK-27324
> URL: https://issues.apache.org/jira/browse/SPARK-27324
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Wing Yew Poon
>Assignee: Wing Yew Poon
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-23429 introduced executor memory metrics, and the configuration, 
> spark.eventLog.logStageExecutorMetrics.enabled, that determines if per-stage 
> per-executor metric peaks get written to the event log. (The metrics are 
> polled and sent in the heartbeat, and this is always done; the configuration 
> is only to determine if aggregated metric peaks are written to the event log.)
> SPARK-24958 added proc fs based metrics to the executor memory metrics, and 
> the configuration, spark.eventLog.logStageExecutorProcessTreeMetrics.enabled, 
> to determine if these additional (more expensive) metrics are collected when 
> metrics are polled.
> SPARK-26329 will introduce a configuration, 
> spark.executor.metrics.pollingInterval, to allow polling at more frequent 
> intervals than the executor heartbeat.
> These configurations and how they relate to each other should be documented 
> in the Configuration page.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27324) document configurations related to executor metrics

2020-01-31 Thread Imran Rashid (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-27324:


Assignee: Wing Yew Poon

> document configurations related to executor metrics
> ---
>
> Key: SPARK-27324
> URL: https://issues.apache.org/jira/browse/SPARK-27324
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Wing Yew Poon
>Assignee: Wing Yew Poon
>Priority: Major
>
> SPARK-23429 introduced executor memory metrics, and the configuration, 
> spark.eventLog.logStageExecutorMetrics.enabled, that determines if per-stage 
> per-executor metric peaks get written to the event log. (The metrics are 
> polled and sent in the heartbeat, and this is always done; the configuration 
> is only to determine if aggregated metric peaks are written to the event log.)
> SPARK-24958 added proc fs based metrics to the executor memory metrics, and 
> the configuration, spark.eventLog.logStageExecutorProcessTreeMetrics.enabled, 
> to determine if these additional (more expensive) metrics are collected when 
> metrics are polled.
> SPARK-26329 will introduce a configuration, 
> spark.executor.metrics.pollingInterval, to allow polling at more frequent 
> intervals than the executor heartbeat.
> These configurations and how they relate to each other should be documented 
> in the Configuration page.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27324) document configurations related to executor metrics

2020-01-31 Thread Imran Rashid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027807#comment-17027807
 ] 

Imran Rashid commented on SPARK-27324:
--

Fixed by https://github.com/apache/spark/pull/27329

> document configurations related to executor metrics
> ---
>
> Key: SPARK-27324
> URL: https://issues.apache.org/jira/browse/SPARK-27324
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Wing Yew Poon
>Assignee: Wing Yew Poon
>Priority: Major
>
> SPARK-23429 introduced executor memory metrics, and the configuration, 
> spark.eventLog.logStageExecutorMetrics.enabled, that determines if per-stage 
> per-executor metric peaks get written to the event log. (The metrics are 
> polled and sent in the heartbeat, and this is always done; the configuration 
> is only to determine if aggregated metric peaks are written to the event log.)
> SPARK-24958 added proc fs based metrics to the executor memory metrics, and 
> the configuration, spark.eventLog.logStageExecutorProcessTreeMetrics.enabled, 
> to determine if these additional (more expensive) metrics are collected when 
> metrics are polled.
> SPARK-26329 will introduce a configuration, 
> spark.executor.metrics.pollingInterval, to allow polling at more frequent 
> intervals than the executor heartbeat.
> These configurations and how they relate to each other should be documented 
> in the Configuration page.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2020-01-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027801#comment-17027801
 ] 

Dongjoon Hyun commented on SPARK-25355:
---

[~pedro.rossi]. 2.4.x is maintained, but Apache Spark doesn't allow a 
backporting of new feature.

If your new PR is a redundant code at 3.0.0, we cannot merge it.

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30684) Show the descripton of metrics for WholeStageCodegen in DAG viz

2020-01-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30684.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27405
[https://github.com/apache/spark/pull/27405]

> Show the descripton of metrics for WholeStageCodegen in DAG viz
> ---
>
> Key: SPARK-30684
> URL: https://issues.apache.org/jira/browse/SPARK-30684
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.0.0
>
>
> In the DAG viz for SQL, A WholeStageCodegen-node shows some metrics like `33 
> ms (1 ms, 1 ms, 26 ms (stage 16 (attempt 0): task 172))` but users can't 
> understand what they mean because there are no description about those 
> metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30036) REPARTITION hint dose not work with order by

2020-01-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30036:
--
Fix Version/s: (was: 3.0.0)

> REPARTITION hint dose not work with order by
> 
>
> Key: SPARK-30036
> URL: https://issues.apache.org/jira/browse/SPARK-30036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jackey Lee
>Priority: Major
>
> Example SQL: select /*+ REPARTITION(2) */ * from test order by a
> == Physical Plan ==
> *(1) Sort [a#0 ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 2), true, [id=#11]
>      +- Exchange RoundRobinPartitioning(3)
>           +- Scan hive default.test [a#0, b#1], HiveTableRelation 
> `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, 
> b#1|#0, b#1]
> EnsureRequirements adds ShuffleExchangeExec (RangePartitioning) after Sort if 
> RoundRobinPartitioning behinds it. This will cause 2 shuffles, and the number 
> of partitions in the final stage is not the number specified by 
> RoundRobinPartitioning.
> This patch will add a new rule to change RoundRobinPartitioning to 
> RangePartitioning rather than add new ShuffleExchangeExec.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30036) REPARTITION hint dose not work with order by

2020-01-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027779#comment-17027779
 ] 

Dongjoon Hyun commented on SPARK-30036:
---

This is reverted via 
[https://github.com/apache/spark/commit/a2de20c0e6857653de63f46052935784be87d34f]
 

> REPARTITION hint dose not work with order by
> 
>
> Key: SPARK-30036
> URL: https://issues.apache.org/jira/browse/SPARK-30036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jackey Lee
>Assignee: Jackey Lee
>Priority: Major
> Fix For: 3.0.0
>
>
> Example SQL: select /*+ REPARTITION(2) */ * from test order by a
> == Physical Plan ==
> *(1) Sort [a#0 ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 2), true, [id=#11]
>      +- Exchange RoundRobinPartitioning(3)
>           +- Scan hive default.test [a#0, b#1], HiveTableRelation 
> `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, 
> b#1|#0, b#1]
> EnsureRequirements adds ShuffleExchangeExec (RangePartitioning) after Sort if 
> RoundRobinPartitioning behinds it. This will cause 2 shuffles, and the number 
> of partitions in the final stage is not the number specified by 
> RoundRobinPartitioning.
> This patch will add a new rule to change RoundRobinPartitioning to 
> RangePartitioning rather than add new ShuffleExchangeExec.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-30036) REPARTITION hint dose not work with order by

2020-01-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-30036:
---
  Assignee: (was: Jackey Lee)

> REPARTITION hint dose not work with order by
> 
>
> Key: SPARK-30036
> URL: https://issues.apache.org/jira/browse/SPARK-30036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jackey Lee
>Priority: Major
> Fix For: 3.0.0
>
>
> Example SQL: select /*+ REPARTITION(2) */ * from test order by a
> == Physical Plan ==
> *(1) Sort [a#0 ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 2), true, [id=#11]
>      +- Exchange RoundRobinPartitioning(3)
>           +- Scan hive default.test [a#0, b#1], HiveTableRelation 
> `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, 
> b#1|#0, b#1]
> EnsureRequirements adds ShuffleExchangeExec (RangePartitioning) after Sort if 
> RoundRobinPartitioning behinds it. This will cause 2 shuffles, and the number 
> of partitions in the final stage is not the number specified by 
> RoundRobinPartitioning.
> This patch will add a new rule to change RoundRobinPartitioning to 
> RangePartitioning rather than add new ShuffleExchangeExec.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30691) Add a few main pages

2020-01-31 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30691.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27416
[https://github.com/apache/spark/pull/27416]

> Add a few main pages
> 
>
> Key: SPARK-30691
> URL: https://issues.apache.org/jira/browse/SPARK-30691
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30691) Add a few main pages

2020-01-31 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30691:


Assignee: Huaxin Gao

> Add a few main pages
> 
>
> Key: SPARK-30691
> URL: https://issues.apache.org/jira/browse/SPARK-30691
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30691) Add a few main pages

2020-01-31 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-30691:
-
Priority: Minor  (was: Major)

> Add a few main pages
> 
>
> Key: SPARK-30691
> URL: https://issues.apache.org/jira/browse/SPARK-30691
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30695) Upgrade Apache ORC to 1.5.9

2020-01-31 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-30695:
-

 Summary: Upgrade Apache ORC to 1.5.9
 Key: SPARK-30695
 URL: https://issues.apache.org/jira/browse/SPARK-30695
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This issue aims to update Apache ORC dependency to the latest version 1.5.9.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30658) Limit after on streaming dataframe before streaming agg returns wrong results

2020-01-31 Thread Shixiong Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-30658:
-
Fix Version/s: 3.0.0

> Limit after on streaming dataframe before streaming agg returns wrong results
> -
>
> Key: SPARK-30658
> URL: https://issues.apache.org/jira/browse/SPARK-30658
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4
>Reporter: Tathagata Das
>Priority: Critical
> Fix For: 3.0.0
>
>
> Limit before a streaming aggregate (i.e. [[df.limit(5).groupBy().count()}}) 
> in complete mode was not being planned as a streaming limit. The planner rule 
> planned a logical limit with a stateful streaming limit plan only if the 
> query is in append mode. As a result, instead of allowing max 5 rows across 
> batches, the planned streaming query was allowing 5 rows in every batch thus 
> producing incorrect results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30657) Streaming limit after streaming dropDuplicates can throw error

2020-01-31 Thread Shixiong Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-30657:
-
Fix Version/s: 3.0.0

> Streaming limit after streaming dropDuplicates can throw error
> --
>
> Key: SPARK-30657
> URL: https://issues.apache.org/jira/browse/SPARK-30657
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 3.0.0
>
>
> {{LocalLimitExec}} does not consume the iterator of the child plan. So if 
> there is a limit after a stateful operator like streaming dedup in append 
> mode (e.g. {{streamingdf.dropDuplicates().limit(5}})), the state changes of 
> streaming duplicate may not be committed (most stateful ops commit state 
> changes only after the generated iterator is fully consumed). This leads to 
> the next batch failing with {{java.lang.IllegalStateException: Error reading 
> delta file .../N.delta does not exist}} as the state store delta file was 
> never generated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27996) Spark UI redirect will be failed behind the https reverse proxy

2020-01-31 Thread Igor Shikin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027680#comment-17027680
 ] 

Igor Shikin commented on SPARK-27996:
-

PR to master: [https://github.com/apache/spark/pull/27311]
Help with regression testing in non-kubernetes environment is much appreciated

> Spark UI redirect will be failed behind the https reverse proxy
> ---
>
> Key: SPARK-27996
> URL: https://issues.apache.org/jira/browse/SPARK-27996
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.3
>Reporter: Saisai Shao
>Priority: Minor
>
> When Spark live/history UI is proxied behind the reverse proxy, the redirect 
> will return wrong scheme, for example:
> If reverse proxy is SSL enabled, so the client to reverse proxy is a HTTPS 
> request, whereas if Spark's UI is not SSL enabled, then the request from 
> reverse proxy to Spark UI is a HTTP request, Spark itself treats all the 
> requests as HTTP requests, so the redirect URL is just started with "http", 
> which will be failed to redirect from client. 
> Actually for most of the reverse proxy, the proxy will add an additional 
> header "X-Forwarded-Proto" to tell the backend server that the client request 
> is a https request, so Spark should leverage this header to return the 
> correct URL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2020-01-31 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027672#comment-17027672
 ] 

Pedro Gonçalves Rossi Rodrigues commented on SPARK-25355:
-

This problem also affects version 2.4.4 and the code responsible for this 
changed from the 2.4.4 to 3.0.0 it must have a patch for 2.4.x if this version 
is going to be maintained also

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2020-01-31 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027665#comment-17027665
 ] 

Pedro Gonçalves Rossi Rodrigues commented on SPARK-25355:
-

[~vanzin] I made a simple test regarding this issue, I tested the spark-submit 
with the proxy user option and checked the container args that were generated, 
and it did not include the --proxy-user option, so I made a copy of the driver 
pod and added the --proxy-user option and it worked! Basically is just passing 
the proxy user argument to the driver command if the cluster type is 
kubernetes, I am going to submit a patch for this issue.

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30510) Publicly document options under spark.sql.*

2020-01-31 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-30510:
-
Description: 
SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, 
but it doesn't appear to be documented in [the expected 
place|http://spark.apache.org/docs/2.4.4/configuration.html]. In fact, none of 
the options under {{spark.sql.*}} that are intended for users are documented on 
spark.apache.org/docs.

We should add a new documentation page for these options.

  was:SPARK-20236 added a new option, 
{{spark.sql.sources.partitionOverwriteMode}}, but it doesn't appear to be 
documented in [the expected 
place|http://spark.apache.org/docs/2.4.4/configuration.html].


> Publicly document options under spark.sql.*
> ---
>
> Key: SPARK-30510
> URL: https://issues.apache.org/jira/browse/SPARK-30510
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, 
> but it doesn't appear to be documented in [the expected 
> place|http://spark.apache.org/docs/2.4.4/configuration.html]. In fact, none 
> of the options under {{spark.sql.*}} that are intended for users are 
> documented on spark.apache.org/docs.
> We should add a new documentation page for these options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30510) Publicly document options under spark.sql.*

2020-01-31 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-30510:
-
Summary: Publicly document options under spark.sql.*  (was: Document 
spark.sql.sources.partitionOverwriteMode)

> Publicly document options under spark.sql.*
> ---
>
> Key: SPARK-30510
> URL: https://issues.apache.org/jira/browse/SPARK-30510
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, 
> but it doesn't appear to be documented in [the expected 
> place|http://spark.apache.org/docs/2.4.4/configuration.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30686) Spark 2.4.4 metrics endpoint throwing error

2020-01-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027622#comment-17027622
 ] 

Dongjoon Hyun commented on SPARK-30686:
---

Thanks for investigation. Please ping on that PR, too. If it's related we can 
add a link to that Jira issue.

> Spark 2.4.4 metrics endpoint throwing error
> ---
>
> Key: SPARK-30686
> URL: https://issues.apache.org/jira/browse/SPARK-30686
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Behroz Sikander
>Priority: Major
>
> I am using Spark-standalone in HA mode with zookeeper.
> Once the driver is up and running, whenever I try to access the metrics api 
> using the following URL
> http://master_address/proxy/app-20200130041234-0123/api/v1/applications
> I get the following exception.
> It seems that the request never even reaches the spark code. It would be 
> helpful if somebody can help me.
> {code:java}
> HTTP ERROR 500
> Problem accessing /api/v1/applications. Reason:
> Server Error
> Caused by:
> java.lang.NullPointerException: while trying to invoke the method 
> org.glassfish.jersey.servlet.WebComponent.service(java.net.URI, java.net.URI, 
> javax.servlet.http.HttpServletRequest, 
> javax.servlet.http.HttpServletResponse) of a null object loaded from field 
> org.glassfish.jersey.servlet.ServletContainer.webComponent of an object 
> loaded from local variable 'this'
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:539)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:808)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30511) Spark marks intentionally killed speculative tasks as pending leads to holding idle executors

2020-01-31 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-30511:
-

Assignee: Zebing Lin

> Spark marks intentionally killed speculative tasks as pending leads to 
> holding idle executors
> -
>
> Key: SPARK-30511
> URL: https://issues.apache.org/jira/browse/SPARK-30511
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Zebing Lin
>Assignee: Zebing Lin
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2020-01-15 at 11.13.17.png
>
>
> *TL;DR*
>  When speculative tasks fail/get killed, they are still considered as pending 
> and count towards the calculation of number of needed executors.
> h3. Symptom
> In one of our production job (where it's running 4 tasks per executor), we 
> found that it was holding 6 executors at the end with only 2 tasks running (1 
> speculative). With more logging enabled, we found the job printed:
> {code:java}
> pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
> {code}
>  while the job only had 1 speculative task running and 16 speculative tasks 
> intentionally killed because of corresponding original tasks had finished.
> An easy repro of the issue (`--conf spark.speculation=true --conf 
> spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
> cluster mode):
> {code:java}
> val n = 4000
> val someRDD = sc.parallelize(1 to n, n)
> someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
> if (index < 300 && index >= 150) {
> Thread.sleep(index * 1000) // Fake running tasks
> } else if (index == 300) {
> Thread.sleep(1000 * 1000) // Fake long running tasks
> }
> it.toList.map(x => index + ", " + x).iterator
> }).collect
> {code}
> You will see when running the last task, we would be hold 38 executors (see 
> attachment), which is exactly (152 + 3) / 4 = 38.
> h3. The Bug
> Upon examining the code of _pendingSpeculativeTasks_: 
> {code:java}
> stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
>   numTasks - 
> stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
> }.sum
> {code}
> where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
> _onSpeculativeTaskSubmitted_, but never decremented.  
> _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
> completion. *This means Spark is marking ended speculative tasks as pending, 
> which leads to Spark to hold more executors that it actually needs!*
> I will have a PR ready to fix this issue, along with SPARK-28403 too
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30511) Spark marks intentionally killed speculative tasks as pending leads to holding idle executors

2020-01-31 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-30511.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Spark marks intentionally killed speculative tasks as pending leads to 
> holding idle executors
> -
>
> Key: SPARK-30511
> URL: https://issues.apache.org/jira/browse/SPARK-30511
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Zebing Lin
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2020-01-15 at 11.13.17.png
>
>
> *TL;DR*
>  When speculative tasks fail/get killed, they are still considered as pending 
> and count towards the calculation of number of needed executors.
> h3. Symptom
> In one of our production job (where it's running 4 tasks per executor), we 
> found that it was holding 6 executors at the end with only 2 tasks running (1 
> speculative). With more logging enabled, we found the job printed:
> {code:java}
> pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
> {code}
>  while the job only had 1 speculative task running and 16 speculative tasks 
> intentionally killed because of corresponding original tasks had finished.
> An easy repro of the issue (`--conf spark.speculation=true --conf 
> spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
> cluster mode):
> {code:java}
> val n = 4000
> val someRDD = sc.parallelize(1 to n, n)
> someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
> if (index < 300 && index >= 150) {
> Thread.sleep(index * 1000) // Fake running tasks
> } else if (index == 300) {
> Thread.sleep(1000 * 1000) // Fake long running tasks
> }
> it.toList.map(x => index + ", " + x).iterator
> }).collect
> {code}
> You will see when running the last task, we would be hold 38 executors (see 
> attachment), which is exactly (152 + 3) / 4 = 38.
> h3. The Bug
> Upon examining the code of _pendingSpeculativeTasks_: 
> {code:java}
> stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
>   numTasks - 
> stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
> }.sum
> {code}
> where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
> _onSpeculativeTaskSubmitted_, but never decremented.  
> _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
> completion. *This means Spark is marking ended speculative tasks as pending, 
> which leads to Spark to hold more executors that it actually needs!*
> I will have a PR ready to fix this issue, along with SPARK-28403 too
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30638) add resources as parameter to the PluginContext

2020-01-31 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-30638.
---
Fix Version/s: 3.0.0
 Assignee: Thomas Graves
   Resolution: Fixed

> add resources as parameter to the PluginContext
> ---
>
> Key: SPARK-30638
> URL: https://issues.apache.org/jira/browse/SPARK-30638
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>
> Add the allocates resources to parameters to the PluginContext so that any 
> plugins in driver or executor could use this information to initialize 
> devices or use this information in a useful manner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30694) If exception occured while fetching blocks by ExternalBlockClient, fail early when External Shuffle Service is not alive

2020-01-31 Thread angerszhu (Jira)

angerszhu created SPARK-30694:
-

 Summary: If exception occured while fetching blocks by 
ExternalBlockClient, fail early when External Shuffle Service is not alive
 Key: SPARK-30694
 URL: https://issues.apache.org/jira/browse/SPARK-30694
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30690) Expose the documentation of CalendarInternval in API documentation

2020-01-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30690.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27412
[https://github.com/apache/spark/pull/27412]

> Expose the documentation of CalendarInternval in API documentation
> --
>
> Key: SPARK-30690
> URL: https://issues.apache.org/jira/browse/SPARK-30690
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> We should also expose it in documentation as we marked it as unstable API as 
> of SPARK-30547



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30690) Expose the documentation of CalendarInternval in API documentation

2020-01-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30690:


Assignee: Hyukjin Kwon

> Expose the documentation of CalendarInternval in API documentation
> --
>
> Key: SPARK-30690
> URL: https://issues.apache.org/jira/browse/SPARK-30690
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> We should also expose it in documentation as we marked it as unstable API as 
> of SPARK-30547



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30693) Document STORED AS Clause of CREATE statement in SQL Reference

2020-01-31 Thread jobit mathew (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027441#comment-17027441
 ] 

jobit mathew commented on SPARK-30693:
--

I will work on this

> Document STORED AS Clause of CREATE statement in SQL Reference
> --
>
> Key: SPARK-30693
> URL: https://issues.apache.org/jira/browse/SPARK-30693
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30693) Document STORED AS Clause of CREATE statement in SQL Reference

2020-01-31 Thread jobit mathew (Jira)

jobit mathew created SPARK-30693:


 Summary: Document STORED AS Clause of CREATE statement in SQL 
Reference
 Key: SPARK-30693
 URL: https://issues.apache.org/jira/browse/SPARK-30693
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 2.4.4
Reporter: jobit mathew






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30692) Mechanism to check that all queries of spark structured-streaming are started in case of multiple sink actions.

2020-01-31 Thread Amit (Jira)

Amit  created SPARK-30692:
-

 Summary: Mechanism to check that all queries of spark 
structured-streaming are started in case of multiple sink actions.
 Key: SPARK-30692
 URL: https://issues.apache.org/jira/browse/SPARK-30692
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.2
Reporter: Amit 


Get the Spark StructuredStreaming job status (start/stop) having multiple sink 
actions

We are trying to get the status of StructuredStreaming job, below is the 
requirement

We wanted to push data to a kafkatopic with offset value set to latest, we are 
using spark-listeners to get the job status, however we observed that listener 
is invoked because one of the spark query started but complete spark-job isn't 
actually started as other queries are still initializing, this results in 
data-loss because we pushed the data to kafka topic and kafka server set the 
offset inventory value to the latest, as complete spark job is not started yet 
but listener gets invoked, once spark job is started it didn't consume data 
from kafka as offset on kafka server has been already set to latest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30687) When reading from a file with pre-defined schema and encountering a single value that is not the same type as that of its column , Spark nullifies the entire row

2020-01-31 Thread pavithra ramachandran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027333#comment-17027333
 ] 

pavithra ramachandran commented on SPARK-30687:
---

l would like to work on this issue.

> When reading from a file with pre-defined schema and encountering a single 
> value that is not the same type as that of its column , Spark nullifies the 
> entire row
> -
>
> Key: SPARK-30687
> URL: https://issues.apache.org/jira/browse/SPARK-30687
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bao Nguyen
>Priority: Major
>
> When reading from a file with pre-defined schema and encountering a single 
> value that is not the same type as that of its column , Spark nullifies the 
> entire row instead of setting the value at that cell to be null.
>  
> {code:java}
> case class TestModel(
>   num: Double, test: String, mac: String, value: Double
> )
> val schema = 
> ScalaReflection.schemaFor[TestModel].dataType.asInstanceOf[StructType]
> //here's the content of the file test.data
> //1~test~mac1~2
> //1.0~testdatarow2~mac2~non-numeric
> //2~test1~mac1~3
> val ds = spark
>   .read
>   .schema(schema)
>   .option("delimiter", "~")
>   .csv("/test-data/test.data")
> ds.show();
> //the content of data frame. second row is all null. 
> //  ++-++-+
> //  | num| test| mac|value|
> //  ++-++-+
> //  | 1.0| test|mac1|  2.0|
> //  |null| null|null| null|
> //  | 2.0|test1|mac1|  3.0|
> //  ++-++-+
> //should be
> // ++--++-+ 
> // | num| test | mac|value| 
> // ++--++-+ 
> // | 1.0| test |mac1| 2.0 | 
> // |1.0 |testdatarow2  |mac2| null| 
> // | 2.0|test1 |mac1| 3.0 | 
> // ++--++-+{code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27747) add a logical plan link in the physical plan

2020-01-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27747.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> add a logical plan link in the physical plan
> 
>
> Key: SPARK-27747
> URL: https://issues.apache.org/jira/browse/SPARK-27747
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-01-31 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027330#comment-17027330
 ] 

Rakesh Raushan commented on SPARK-30688:


I will check this issue

 

> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30691) Add a few main pages

2020-01-31 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-30691:
--

 Summary: Add a few main pages
 Key: SPARK-30691
 URL: https://issues.apache.org/jira/browse/SPARK-30691
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30615) normalize the column name in AlterTable

2020-01-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30615:
---

Assignee: Burak Yavuz

> normalize the column name in AlterTable
> ---
>
> Key: SPARK-30615
> URL: https://issues.apache.org/jira/browse/SPARK-30615
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Burak Yavuz
>Priority: Major
>
> Because of case insensitive resolution, the column name in AlterTable may 
> match the table schema but not exactly the same. To ease DS v2 
> implementations, Spark should normalize the column name before passing them 
> to v2 catalogs, so that users don't need to care about the case sensitive 
> config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30615) normalize the column name in AlterTable

2020-01-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30615.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27350
[https://github.com/apache/spark/pull/27350]

> normalize the column name in AlterTable
> ---
>
> Key: SPARK-30615
> URL: https://issues.apache.org/jira/browse/SPARK-30615
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> Because of case insensitive resolution, the column name in AlterTable may 
> match the table schema but not exactly the same. To ease DS v2 
> implementations, Spark should normalize the column name before passing them 
> to v2 catalogs, so that users don't need to care about the case sensitive 
> config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30686) Spark 2.4.4 metrics endpoint throwing error

2020-01-31 Thread Behroz Sikander (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027291#comment-17027291
 ] 

Behroz Sikander commented on SPARK-30686:
-

Could it be linked to [https://github.com/apache/spark/pull/19748] ?

> Spark 2.4.4 metrics endpoint throwing error
> ---
>
> Key: SPARK-30686
> URL: https://issues.apache.org/jira/browse/SPARK-30686
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Behroz Sikander
>Priority: Major
>
> I am using Spark-standalone in HA mode with zookeeper.
> Once the driver is up and running, whenever I try to access the metrics api 
> using the following URL
> http://master_address/proxy/app-20200130041234-0123/api/v1/applications
> I get the following exception.
> It seems that the request never even reaches the spark code. It would be 
> helpful if somebody can help me.
> {code:java}
> HTTP ERROR 500
> Problem accessing /api/v1/applications. Reason:
> Server Error
> Caused by:
> java.lang.NullPointerException: while trying to invoke the method 
> org.glassfish.jersey.servlet.WebComponent.service(java.net.URI, java.net.URI, 
> javax.servlet.http.HttpServletRequest, 
> javax.servlet.http.HttpServletResponse) of a null object loaded from field 
> org.glassfish.jersey.servlet.ServletContainer.webComponent of an object 
> loaded from local variable 'this'
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:539)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:808)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26448) retain the difference between 0.0 and -0.0

2020-01-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027282#comment-17027282
 ] 

Dongjoon Hyun commented on SPARK-26448:
---

Do we need to mark this as a correctness issue or to backport this?

> retain the difference between 0.0 and -0.0
> --
>
> Key: SPARK-26448
> URL: https://issues.apache.org/jira/browse/SPARK-26448
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26021) -0.0 and 0.0 not treated consistently, doesn't match Hive

2020-01-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027280#comment-17027280
 ] 

Dongjoon Hyun commented on SPARK-26021:
---

Is that backported or marked as correctness issue?

> -0.0 and 0.0 not treated consistently, doesn't match Hive
> -
>
> Key: SPARK-26021
> URL: https://issues.apache.org/jira/browse/SPARK-26021
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean R. Owen
>Assignee: Alon Doron
>Priority: Critical
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Per [~adoron] and [~mccheah] and SPARK-24834, I'm splitting this out as a new 
> issue:
> The underlying issue is how Spark and Hive treat 0.0 and -0.0, which are 
> numerically identical but not the same double value:
> In hive, 0.0 and -0.0 are equal since 
> https://issues.apache.org/jira/browse/HIVE-11174.
>  That's not the case with spark sql as "group by" (non-codegen) treats them 
> as different values. Since their hash is different they're put in different 
> buckets of UnsafeFixedWidthAggregationMap.
> In addition there's an inconsistency when using the codegen, for example the 
> following unit test:
> {code:java}
> println(Seq(0.0d, 0.0d, 
> -0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,3]
> {code:java}
> println(Seq(0.0d, -0.0d, 
> 0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,1], [-0.0,2]
> {code:java}
> spark.conf.set("spark.sql.codegen.wholeStage", "false")
> println(Seq(0.0d, -0.0d, 
> 0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,2], [-0.0,1]
> Note that the only difference between the first 2 lines is the order of the 
> elements in the Seq.
>  This inconsistency is resulted by different partitioning of the Seq and the 
> usage of the generated fast hash map in the first, partial, aggregation.
> It looks like we need to add a specific check for -0.0 before hashing (both 
> in codegen and non-codegen modes) if we want to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

78 matches

Mail list logo