date:20201105

[jira] [Assigned] (SPARK-23432) Expose executor memory metrics in the web UI for executors

2020-11-05 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-23432:


Assignee: Zhongwei Zhu

> Expose executor memory metrics in the web UI for executors
> --
>
> Key: SPARK-23432
> URL: https://issues.apache.org/jira/browse/SPARK-23432
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Assignee: Zhongwei Zhu
>Priority: Major
> Fix For: 3.1.0
>
>
> Add the new memory metrics (jvmUsedMemory, executionMemory, storageMemory, 
> and unifiedMemory, etc.) to the executors tab, in the summary and for each 
> executor.
> This is a subtask for SPARK-23206. Please refer to the design doc for that 
> ticket for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23432) Expose executor memory metrics in the web UI for executors

2020-11-05 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-23432.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30186
[https://github.com/apache/spark/pull/30186]

> Expose executor memory metrics in the web UI for executors
> --
>
> Key: SPARK-23432
> URL: https://issues.apache.org/jira/browse/SPARK-23432
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
> Fix For: 3.1.0
>
>
> Add the new memory metrics (jvmUsedMemory, executionMemory, storageMemory, 
> and unifiedMemory, etc.) to the executors tab, in the summary and for each 
> executor.
> This is a subtask for SPARK-23206. Please refer to the design doc for that 
> ticket for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33333) Upgrade Jetty to 9.4.28.v20200408

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227217#comment-17227217
 ] 

Apache Spark commented on SPARK-3:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30276

> Upgrade Jetty to 9.4.28.v20200408
> -
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33302) Failed to push down filters through Expand

2020-11-05 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227212#comment-17227212
 ] 

angerszhu commented on SPARK-33302:
---

working on  this

> Failed to push down filters through Expand
> --
>
> Key: SPARK-33302
> URL: https://issues.apache.org/jira/browse/SPARK-33302
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.4, 3.0.1, 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table SPARK_33302_1(pid int, uid int, sid int, dt date, suid int) 
> using parquet;
> create table SPARK_33302_2(pid int, vs int, uid int, csid int) using parquet;
> SELECT
>years,
>appversion,   
>SUM(uusers) AS users  
> FROM   (SELECT
>Date_trunc('year', dt)  AS years,
>CASE  
>  WHEN h.pid = 3 THEN 'iOS'   
>  WHEN h.pid = 4 THEN 'Android'   
>  ELSE 'Other'
>END AS viewport,  
>h.vsAS appversion,
>Count(DISTINCT u.uid)   AS uusers
>,Count(DISTINCT u.suid) AS srcusers
> FROM   SPARK_33302_1 u   
>join SPARK_33302_2 h  
>  ON h.uid = u.uid
> GROUP  BY 1, 
>   2, 
>   3) AS a
> WHERE  viewport = 'iOS'  
> GROUP  BY 1, 
>   2
> {code}
> {noformat}
> == Physical Plan ==
> *(5) HashAggregate(keys=[years#30, appversion#32], 
> functions=[sum(uusers#33L)])
> +- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251]
>+- *(4) HashAggregate(keys=[years#30, appversion#32], 
> functions=[partial_sum(uusers#33L)])
>   +- *(4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) 
> u.`uid`#47 else null)])
>  +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246]
> +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = 
> 1)) u.`uid`#47 else null)])
>+- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], 
> functions=[])
>   +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` 
> AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), 
> true, [id=#241]
>  +- *(2) HashAggregate(keys=[date_trunc('year', 
> CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN 
> (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, 
> u.`suid`#48, gid#44], functions=[])
> +- *(2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' 
> WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS)
>+- *(2) Expand [ArrayBuffer(date_trunc(year, 
> cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS 
> WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), 
> ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE 
> WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, 
> vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, 
> CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 
> 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44]
>   +- *(2) Project [uid#7, dt#9, suid#10, pid#11, 
> vs#12]
>  +- *(2) BroadcastHashJoin [uid#7], [uid#13], 
> Inner, BuildRight
> :- *(2) Project [uid#7, dt#9, suid#10]
> :  +- *(2) Filter isnotnull(ui

[jira] [Commented] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227201#comment-17227201
 ] 

Apache Spark commented on SPARK-32691:
--

User 'huangtianhua' has created a pull request for this issue:
https://github.com/apache/spark/pull/30275

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
> Environment: ARM64
>Reporter: huangtianhua
>Assignee: zhengruifeng
>Priority: Major
> Attachments: Screen Shot 2020-09-28 at 8.49.04 AM.png, failure.log, 
> success.log
>
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227200#comment-17227200
 ] 

Apache Spark commented on SPARK-32691:
--

User 'huangtianhua' has created a pull request for this issue:
https://github.com/apache/spark/pull/30275

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
> Environment: ARM64
>Reporter: huangtianhua
>Assignee: zhengruifeng
>Priority: Major
> Attachments: Screen Shot 2020-09-28 at 8.49.04 AM.png, failure.log, 
> success.log
>
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32691:


Assignee: Apache Spark  (was: zhengruifeng)

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
> Environment: ARM64
>Reporter: huangtianhua
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screen Shot 2020-09-28 at 8.49.04 AM.png, failure.log, 
> success.log
>
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32691:


Assignee: zhengruifeng  (was: Apache Spark)

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
> Environment: ARM64
>Reporter: huangtianhua
>Assignee: zhengruifeng
>Priority: Major
> Attachments: Screen Shot 2020-09-28 at 8.49.04 AM.png, failure.log, 
> success.log
>
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-11-05 Thread huangtianhua (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227196#comment-17227196
 ] 

huangtianhua commented on SPARK-32691:
--

We have found the problem, it is take long time to replicate remote over the 
default timeout 120 seconds, so it try again to another executor, but in fact 
the replication is complete, so there are 3 replications total. Then we found 
the progress hang in CryptoRandomFactory.getCryptoRandom(properties), we found 
the jar commons-crypto v1.0.0 doesn't support aarch64, after we change to use 
v1.1.0 then the tests pass and the time is short. 
So I plan to propose a PR to change to use commons-crypto v1.1.0 which support 
aarch64: http://commons.apache.org/proper/commons-crypto/changes-report.html  
https://issues.apache.org/jira/browse/CRYPTO-139

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
> Environment: ARM64
>Reporter: huangtianhua
>Assignee: zhengruifeng
>Priority: Major
> Attachments: Screen Shot 2020-09-28 at 8.49.04 AM.png, failure.log, 
> success.log
>
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command

2020-11-05 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227190#comment-17227190
 ] 

angerszhu commented on SPARK-33326:
---

Can you show which hive version  you are using? Seems in current branch won't 
show like this

> Partition Parameters are not updated even after ANALYZE TABLE command
> -
>
> Key: SPARK-33326
> URL: https://issues.apache.org/jira/browse/SPARK-33326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Daniel Bondor
>Priority: Major
>
> Here are the reproduction steps:
> {code:java}
> scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p 
> string) STORED AS PARQUET")
> Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')")
> res2: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false)
> ...
> |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, 
> transient_lastDdlTime=1604404640, totalSize=532, 
> COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}},
>  numRows=0}| |
> ...
> |Partition Statistics |1064 bytes, 2 rows | |
> ...
> {code}
> My expectation would be that the Partition Parameters should be updated after 
> ANALYZE TABLE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33364) Expose purge option in TableCatalog.dropTable

2020-11-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33364.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30267
[https://github.com/apache/spark/pull/30267]

> Expose purge option in TableCatalog.dropTable
> -
>
> Key: SPARK-33364
> URL: https://issues.apache.org/jira/browse/SPARK-33364
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.1.0
>
>
> TableCatalog.dropTable currently does not support the purge option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33364) Expose purge option in TableCatalog.dropTable

2020-11-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33364:
-

Assignee: Terry Kim

> Expose purge option in TableCatalog.dropTable
> -
>
> Key: SPARK-33364
> URL: https://issues.apache.org/jira/browse/SPARK-33364
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
>
> TableCatalog.dropTable currently does not support the purge option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33130) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MsSqlServer dialect)

2020-11-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33130.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30038
[https://github.com/apache/spark/pull/30038]

> Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and 
> nullability of columns (MsSqlServer dialect)
> ---
>
> Key: SPARK-33130
> URL: https://issues.apache.org/jira/browse/SPARK-33130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> Override the default SQL strings for:
> ALTER TABLE RENAME COLUMN
> ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following MsSQLServer JDBC dialect according to official documentation.
> Write MsSqlServer integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33130) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MsSqlServer dialect)

2020-11-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33130:
---

Assignee: Prashant Sharma

> Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and 
> nullability of columns (MsSqlServer dialect)
> ---
>
> Key: SPARK-33130
> URL: https://issues.apache.org/jira/browse/SPARK-33130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
>
> Override the default SQL strings for:
> ALTER TABLE RENAME COLUMN
> ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following MsSQLServer JDBC dialect according to official documentation.
> Write MsSqlServer integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33342) In the threadDump page, when a thread is blocked by anther thread, the blocking thread name and url were both displayed incorrect, causing the url to fail to jump.

2020-11-05 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-33342.

Resolution: Fixed

Issue resolved by pull request 30249
[https://github.com/apache/spark/pull/30249]

> In the threadDump page, when a thread is blocked by anther thread, the 
> blocking thread name and url were both displayed incorrect, causing the url 
> to fail to jump.
> ---
>
> Key: SPARK-33342
> URL: https://issues.apache.org/jira/browse/SPARK-33342
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
> Environment: spark version: spark3.0.1
>Reporter: akiyamaneko
>Assignee: akiyamaneko
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: cannot-jump.gif, display error.png
>
>
> In the threadDump page, when a thread is blocked by anther thread, the 
> blocking thread name and url were both displayed incorrect, causing the url 
> to fail to jump.
> such as *Thread 73*  but shows `*Thread some(73)*`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33342) In the threadDump page, when a thread is blocked by anther thread, the blocking thread name and url were both displayed incorrect, causing the url to fail to jump.

2020-11-05 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-33342:
--

Assignee: akiyamaneko

> In the threadDump page, when a thread is blocked by anther thread, the 
> blocking thread name and url were both displayed incorrect, causing the url 
> to fail to jump.
> ---
>
> Key: SPARK-33342
> URL: https://issues.apache.org/jira/browse/SPARK-33342
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
> Environment: spark version: spark3.0.1
>Reporter: akiyamaneko
>Assignee: akiyamaneko
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: cannot-jump.gif, display error.png
>
>
> In the threadDump page, when a thread is blocked by anther thread, the 
> blocking thread name and url were both displayed incorrect, causing the url 
> to fail to jump.
> such as *Thread 73*  but shows `*Thread some(73)*`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32860) Encoders::bean doc incorrectly states maps are not supported

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227177#comment-17227177
 ] 

Apache Spark commented on SPARK-32860:
--

User 'hannahkamundson' has created a pull request for this issue:
https://github.com/apache/spark/pull/30274

> Encoders::bean doc incorrectly states maps are not supported
> 
>
> Key: SPARK-32860
> URL: https://issues.apache.org/jira/browse/SPARK-32860
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.1, 3.1.0
>Reporter: Dan Ziemba
>Priority: Trivial
>  Labels: starter
>
> The documentation for the bean method in the Encoders class currently states:
> {quote}collection types: only array and java.util.List currently, map support 
> is in progress
> {quote}
> But map support appears to work properly and has been available since 2.1.0 
> according to SPARK-16706.  Documentation should be updated to match what is / 
> is not actually supported (Set, Queue, etc?).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33369) Skip schema inference in DataframeWriter.save() if table provider supports external metadata

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227174#comment-17227174
 ] 

Apache Spark commented on SPARK-33369:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/30273

> Skip schema inference in DataframeWriter.save() if table provider supports 
> external metadata
> 
>
> Key: SPARK-33369
> URL: https://issues.apache.org/jira/browse/SPARK-33369
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> For all the v2 data sources which are not FileDataSourceV2, Spark always 
> infers the table schema/partitioning on DataframeWriter.save().
> The inference of table schema/partitioning can be expensive. However, there 
> is no such trait or flag for indicating a V2 source can use the input 
> DataFrame's schema on DataframeWriter.save(). We can resolve the problem by 
> adding a new expected behavior for the method 
> TableProvider.supportsExternalMetadata():
> When TableProvider.supportsExternalMetadata() is true, Spark will use the 
> input Dataframe's schema in DataframeWriter.save() and skip 
> schema/partitioning inference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33369) Skip schema inference in DataframeWriter.save() if table provider supports external metadata

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33369:


Assignee: Apache Spark  (was: Gengliang Wang)

> Skip schema inference in DataframeWriter.save() if table provider supports 
> external metadata
> 
>
> Key: SPARK-33369
> URL: https://issues.apache.org/jira/browse/SPARK-33369
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> For all the v2 data sources which are not FileDataSourceV2, Spark always 
> infers the table schema/partitioning on DataframeWriter.save().
> The inference of table schema/partitioning can be expensive. However, there 
> is no such trait or flag for indicating a V2 source can use the input 
> DataFrame's schema on DataframeWriter.save(). We can resolve the problem by 
> adding a new expected behavior for the method 
> TableProvider.supportsExternalMetadata():
> When TableProvider.supportsExternalMetadata() is true, Spark will use the 
> input Dataframe's schema in DataframeWriter.save() and skip 
> schema/partitioning inference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33369) Skip schema inference in DataframeWriter.save() if table provider supports external metadata

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227173#comment-17227173
 ] 

Apache Spark commented on SPARK-33369:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/30273

> Skip schema inference in DataframeWriter.save() if table provider supports 
> external metadata
> 
>
> Key: SPARK-33369
> URL: https://issues.apache.org/jira/browse/SPARK-33369
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> For all the v2 data sources which are not FileDataSourceV2, Spark always 
> infers the table schema/partitioning on DataframeWriter.save().
> The inference of table schema/partitioning can be expensive. However, there 
> is no such trait or flag for indicating a V2 source can use the input 
> DataFrame's schema on DataframeWriter.save(). We can resolve the problem by 
> adding a new expected behavior for the method 
> TableProvider.supportsExternalMetadata():
> When TableProvider.supportsExternalMetadata() is true, Spark will use the 
> input Dataframe's schema in DataframeWriter.save() and skip 
> schema/partitioning inference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33369) Skip schema inference in DataframeWriter.save() if table provider supports external metadata

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33369:


Assignee: Gengliang Wang  (was: Apache Spark)

> Skip schema inference in DataframeWriter.save() if table provider supports 
> external metadata
> 
>
> Key: SPARK-33369
> URL: https://issues.apache.org/jira/browse/SPARK-33369
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> For all the v2 data sources which are not FileDataSourceV2, Spark always 
> infers the table schema/partitioning on DataframeWriter.save().
> The inference of table schema/partitioning can be expensive. However, there 
> is no such trait or flag for indicating a V2 source can use the input 
> DataFrame's schema on DataframeWriter.save(). We can resolve the problem by 
> adding a new expected behavior for the method 
> TableProvider.supportsExternalMetadata():
> When TableProvider.supportsExternalMetadata() is true, Spark will use the 
> input Dataframe's schema in DataframeWriter.save() and skip 
> schema/partitioning inference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33369) Skip schema inference in DataframeWriter.save() if table provider supports external metadata

2020-11-05 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-33369:
---
Description: 
For all the v2 data sources which are not FileDataSourceV2, Spark always infers 
the table schema/partitioning on DataframeWriter.save().
The inference of table schema/partitioning can be expensive. However, there is 
no such trait or flag for indicating a V2 source can use the input DataFrame's 
schema on DataframeWriter.save(). We can resolve the problem by adding a new 
expected behavior for the method TableProvider.supportsExternalMetadata():
When TableProvider.supportsExternalMetadata() is true, Spark will use the input 
Dataframe's schema in DataframeWriter.save() and skip schema/partitioning 
inference.

  was:
For all the v2 data sources which are not FileDataSourceV2, Spark always infer 
the table schema/partitioning on DataframeWriter.save().
Currently, there is no such trait or flag for indicating a V2 source can use 
the input DataFrame's schema on DataframeWriter.save().  We can resolve the 
problem by adding a new expected behavior for the method 
TableProvider.supportsExternalMetadata():
when TableProvider.supportsExternalMetadata() is true, Spark will use the input 
Dataframe's schema in DataframeWriter.save() and skip schema/partitioning 
inference.


> Skip schema inference in DataframeWriter.save() if table provider supports 
> external metadata
> 
>
> Key: SPARK-33369
> URL: https://issues.apache.org/jira/browse/SPARK-33369
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> For all the v2 data sources which are not FileDataSourceV2, Spark always 
> infers the table schema/partitioning on DataframeWriter.save().
> The inference of table schema/partitioning can be expensive. However, there 
> is no such trait or flag for indicating a V2 source can use the input 
> DataFrame's schema on DataframeWriter.save(). We can resolve the problem by 
> adding a new expected behavior for the method 
> TableProvider.supportsExternalMetadata():
> When TableProvider.supportsExternalMetadata() is true, Spark will use the 
> input Dataframe's schema in DataframeWriter.save() and skip 
> schema/partitioning inference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33369) Skip schema inference in DataframeWriter.save() if table provider supports external metadata

2020-11-05 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-33369:
--

 Summary: Skip schema inference in DataframeWriter.save() if table 
provider supports external metadata
 Key: SPARK-33369
 URL: https://issues.apache.org/jira/browse/SPARK-33369
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


For all the v2 data sources which are not FileDataSourceV2, Spark always infer 
the table schema/partitioning on DataframeWriter.save().
Currently, there is no such trait or flag for indicating a V2 source can use 
the input DataFrame's schema on DataframeWriter.save().  We can resolve the 
problem by adding a new expected behavior for the method 
TableProvider.supportsExternalMetadata():
when TableProvider.supportsExternalMetadata() is true, Spark will use the input 
Dataframe's schema in DataframeWriter.save() and skip schema/partitioning 
inference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33368) SimplifyConditionals simplifies non-deterministic expressions

2020-11-05 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-33368:
---

 Summary: SimplifyConditionals simplifies non-deterministic 
expressions
 Key: SPARK-33368
 URL: https://issues.apache.org/jira/browse/SPARK-33368
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 2.4.7, 3.1.0
Reporter: Yuming Wang


It seems we simplified non-deterministic expressions with aliases. for example:
{code:sql}
CREATE TABLE t(a int, b int, c int) using parquet
{code}

{code:sql}
sql
SELECT CASE 
 WHEN rand(100) > 1 THEN 1 
 WHEN rand(100) + 1 > 1000 THEN 1 
 WHEN rand(100) + 2 < 100 THEN 1 
 ELSE 1 
END AS x 
FROM t 
{code}

The plan is:
{noformat}
== Physical Plan ==
*(1) Project [CASE WHEN (rand(100) > 1.0) THEN 1 WHEN ((rand(100) + 1.0) > 
1000.0) THEN 1 WHEN ((rand(100) + 2.0) < 100.0) THEN 1 ELSE 1 END AS x#6]
+- *(1) ColumnarToRow
 +- FileScan parquet default.t[] Batched: true, DataFilters: [], Format: 
Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/opensource/spark/sql/core/spark-warehouse/org.apache.spark,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
{noformat}


{code:sql}
SELECT CASE 
 WHEN rd > 1 THEN 1 
 WHEN rd + 1 > 1000 THEN 1 
 WHEN rd + 2 < 100 THEN 1 
 ELSE 1 
END AS x 
FROM (SELECT *, rand(100) as rd FROM t) t1 
{code}

The plan is:
{noformat}
== Physical Plan ==
*(1) Project [1 AS x#1]
+- *(1) ColumnarToRow
 +- FileScan parquet default.t[] Batched: true, DataFilters: [], Format: 
Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/opensource/spark/sql/core/spark-warehouse/org.apache.spark,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33367) Move the table commands from ddl.scala to table.scala

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33367:


Assignee: Apache Spark

> Move the table commands from ddl.scala to table.scala
> -
>
> Key: SPARK-33367
> URL: https://issues.apache.org/jira/browse/SPARK-33367
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Minor
>
> The table commands can be moved from `ddl.scala` to `table.scala` per the 
> following:
> https://github.com/apache/spark/blob/4941b7ae18d4081233953cc11328645d0b4cf208/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L137
> This change will result in table-related commands in the same file 
> `table.scala`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33367) Move the table commands from ddl.scala to table.scala

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227169#comment-17227169
 ] 

Apache Spark commented on SPARK-33367:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/30271

> Move the table commands from ddl.scala to table.scala
> -
>
> Key: SPARK-33367
> URL: https://issues.apache.org/jira/browse/SPARK-33367
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> The table commands can be moved from `ddl.scala` to `table.scala` per the 
> following:
> https://github.com/apache/spark/blob/4941b7ae18d4081233953cc11328645d0b4cf208/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L137
> This change will result in table-related commands in the same file 
> `table.scala`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33367) Move the table commands from ddl.scala to table.scala

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33367:


Assignee: (was: Apache Spark)

> Move the table commands from ddl.scala to table.scala
> -
>
> Key: SPARK-33367
> URL: https://issues.apache.org/jira/browse/SPARK-33367
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> The table commands can be moved from `ddl.scala` to `table.scala` per the 
> following:
> https://github.com/apache/spark/blob/4941b7ae18d4081233953cc11328645d0b4cf208/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L137
> This change will result in table-related commands in the same file 
> `table.scala`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33367) Move the table commands from ddl.scala to table.scala

2020-11-05 Thread Terry Kim (Jira)

Terry Kim created SPARK-33367:
-

 Summary: Move the table commands from ddl.scala to table.scala
 Key: SPARK-33367
 URL: https://issues.apache.org/jira/browse/SPARK-33367
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Terry Kim


The table commands can be moved from `ddl.scala` to `table.scala` per the 
following:
https://github.com/apache/spark/blob/4941b7ae18d4081233953cc11328645d0b4cf208/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L137

This change will result in table-related commands in the same file 
`table.scala`.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32860) Encoders::bean doc incorrectly states maps are not supported

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32860:


Assignee: (was: Apache Spark)

> Encoders::bean doc incorrectly states maps are not supported
> 
>
> Key: SPARK-32860
> URL: https://issues.apache.org/jira/browse/SPARK-32860
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.1, 3.1.0
>Reporter: Dan Ziemba
>Priority: Trivial
>  Labels: starter
>
> The documentation for the bean method in the Encoders class currently states:
> {quote}collection types: only array and java.util.List currently, map support 
> is in progress
> {quote}
> But map support appears to work properly and has been available since 2.1.0 
> according to SPARK-16706.  Documentation should be updated to match what is / 
> is not actually supported (Set, Queue, etc?).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32860) Encoders::bean doc incorrectly states maps are not supported

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227155#comment-17227155
 ] 

Apache Spark commented on SPARK-32860:
--

User 'hannahkamundson' has created a pull request for this issue:
https://github.com/apache/spark/pull/30272

> Encoders::bean doc incorrectly states maps are not supported
> 
>
> Key: SPARK-32860
> URL: https://issues.apache.org/jira/browse/SPARK-32860
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.1, 3.1.0
>Reporter: Dan Ziemba
>Priority: Trivial
>  Labels: starter
>
> The documentation for the bean method in the Encoders class currently states:
> {quote}collection types: only array and java.util.List currently, map support 
> is in progress
> {quote}
> But map support appears to work properly and has been available since 2.1.0 
> according to SPARK-16706.  Documentation should be updated to match what is / 
> is not actually supported (Set, Queue, etc?).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32860) Encoders::bean doc incorrectly states maps are not supported

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32860:


Assignee: Apache Spark

> Encoders::bean doc incorrectly states maps are not supported
> 
>
> Key: SPARK-32860
> URL: https://issues.apache.org/jira/browse/SPARK-32860
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.1, 3.1.0
>Reporter: Dan Ziemba
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: starter
>
> The documentation for the bean method in the Encoders class currently states:
> {quote}collection types: only array and java.util.List currently, map support 
> is in progress
> {quote}
> But map support appears to work properly and has been available since 2.1.0 
> according to SPARK-16706.  Documentation should be updated to match what is / 
> is not actually supported (Set, Queue, etc?).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33366) Migrate LOAD DATA to new resolution framework

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227127#comment-17227127
 ] 

Apache Spark commented on SPARK-33366:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/30270

> Migrate LOAD DATA to new resolution framework
> -
>
> Key: SPARK-33366
> URL: https://issues.apache.org/jira/browse/SPARK-33366
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> Migrate LOAD DATA to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33366) Migrate LOAD DATA to new resolution framework

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33366:


Assignee: (was: Apache Spark)

> Migrate LOAD DATA to new resolution framework
> -
>
> Key: SPARK-33366
> URL: https://issues.apache.org/jira/browse/SPARK-33366
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> Migrate LOAD DATA to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33366) Migrate LOAD DATA to new resolution framework

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227126#comment-17227126
 ] 

Apache Spark commented on SPARK-33366:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/30270

> Migrate LOAD DATA to new resolution framework
> -
>
> Key: SPARK-33366
> URL: https://issues.apache.org/jira/browse/SPARK-33366
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> Migrate LOAD DATA to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33366) Migrate LOAD DATA to new resolution framework

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33366:


Assignee: Apache Spark

> Migrate LOAD DATA to new resolution framework
> -
>
> Key: SPARK-33366
> URL: https://issues.apache.org/jira/browse/SPARK-33366
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Minor
>
> Migrate LOAD DATA to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command

2020-11-05 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227112#comment-17227112
 ] 

angerszhu commented on SPARK-33326:
---

working  on this.

> Partition Parameters are not updated even after ANALYZE TABLE command
> -
>
> Key: SPARK-33326
> URL: https://issues.apache.org/jira/browse/SPARK-33326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Daniel Bondor
>Priority: Major
>
> Here are the reproduction steps:
> {code:java}
> scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p 
> string) STORED AS PARQUET")
> Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')")
> res2: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false)
> ...
> |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, 
> transient_lastDdlTime=1604404640, totalSize=532, 
> COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}},
>  numRows=0}| |
> ...
> |Partition Statistics |1064 bytes, 2 rows | |
> ...
> {code}
> My expectation would be that the Partition Parameters should be updated after 
> ANALYZE TABLE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33366) Migrate LOAD DATA to new resolution framework

2020-11-05 Thread Terry Kim (Jira)

Terry Kim created SPARK-33366:
-

 Summary: Migrate LOAD DATA to new resolution framework
 Key: SPARK-33366
 URL: https://issues.apache.org/jira/browse/SPARK-33366
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Terry Kim


Migrate LOAD DATA to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command

2020-11-05 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-33326:
--
Comment: was deleted

(was: Working. on this.

[~dabondor] Your hive version is  2.3?

Since seems Hive-1.2.1 don't have this partition parameter)

> Partition Parameters are not updated even after ANALYZE TABLE command
> -
>
> Key: SPARK-33326
> URL: https://issues.apache.org/jira/browse/SPARK-33326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Daniel Bondor
>Priority: Major
>
> Here are the reproduction steps:
> {code:java}
> scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p 
> string) STORED AS PARQUET")
> Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')")
> res2: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false)
> ...
> |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, 
> transient_lastDdlTime=1604404640, totalSize=532, 
> COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}},
>  numRows=0}| |
> ...
> |Partition Statistics |1064 bytes, 2 rows | |
> ...
> {code}
> My expectation would be that the Partition Parameters should be updated after 
> ANALYZE TABLE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command

2020-11-05 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227107#comment-17227107
 ] 

angerszhu edited comment on SPARK-33326 at 11/6/20, 2:05 AM:
-

Working. on this.

[~dabondor] Your hive version is  2.3?

Since seems Hive-1.2.1 don't have this partition parameter


was (Author: angerszhuuu):
Working. on this

> Partition Parameters are not updated even after ANALYZE TABLE command
> -
>
> Key: SPARK-33326
> URL: https://issues.apache.org/jira/browse/SPARK-33326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Daniel Bondor
>Priority: Major
>
> Here are the reproduction steps:
> {code:java}
> scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p 
> string) STORED AS PARQUET")
> Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')")
> res2: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false)
> ...
> |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, 
> transient_lastDdlTime=1604404640, totalSize=532, 
> COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}},
>  numRows=0}| |
> ...
> |Partition Statistics |1064 bytes, 2 rows | |
> ...
> {code}
> My expectation would be that the Partition Parameters should be updated after 
> ANALYZE TABLE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command

2020-11-05 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227107#comment-17227107
 ] 

angerszhu commented on SPARK-33326:
---

Working. on this

> Partition Parameters are not updated even after ANALYZE TABLE command
> -
>
> Key: SPARK-33326
> URL: https://issues.apache.org/jira/browse/SPARK-33326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Daniel Bondor
>Priority: Major
>
> Here are the reproduction steps:
> {code:java}
> scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p 
> string) STORED AS PARQUET")
> Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')")
> res2: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false)
> ...
> |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, 
> transient_lastDdlTime=1604404640, totalSize=532, 
> COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}},
>  numRows=0}| |
> ...
> |Partition Statistics |1064 bytes, 2 rows | |
> ...
> {code}
> My expectation would be that the Partition Parameters should be updated after 
> ANALYZE TABLE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33365) Update SBT to 1.4.2

2020-11-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33365:
-

Assignee: William Hyun

> Update SBT to 1.4.2
> ---
>
> Key: SPARK-33365
> URL: https://issues.apache.org/jira/browse/SPARK-33365
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33365) Update SBT to 1.4.2

2020-11-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33365.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30268
[https://github.com/apache/spark/pull/30268]

> Update SBT to 1.4.2
> ---
>
> Key: SPARK-33365
> URL: https://issues.apache.org/jira/browse/SPARK-33365
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33360) simplify DS v2 write resolution

2020-11-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33360.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30264
[https://github.com/apache/spark/pull/30264]

> simplify DS v2 write resolution
> ---
>
> Key: SPARK-33360
> URL: https://issues.apache.org/jira/browse/SPARK-33360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33361) Dataset.observe() functionality does not work with structured streaming

2020-11-05 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227078#comment-17227078
 ] 

Jungtaek Lim commented on SPARK-33361:
--

cc. [~hvanhovell] 

> Dataset.observe() functionality does not work with structured streaming
> ---
>
> Key: SPARK-33361
> URL: https://issues.apache.org/jira/browse/SPARK-33361
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: Spark on k8s, version 3.0.0
>Reporter: John Wesley
>Priority: Major
>
> The dataset observe() functionality does not work as expected with spark in 
> cluster mode.
> Related discussion here:
> [http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-emit-custom-metrics-to-Prometheus-in-spark-structured-streaming-td38826.html]
> Using lit() as the aggregation column goes through well. However sum, count 
> etc returns 0 all the time.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33359) foreachBatch sink outputs wrong metrics

2020-11-05 Thread John Wesley (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227064#comment-17227064
 ] 

John Wesley commented on SPARK-33359:
-

Thanks for the clarification. How about numInputRows?

> foreachBatch sink outputs wrong metrics
> ---
>
> Key: SPARK-33359
> URL: https://issues.apache.org/jira/browse/SPARK-33359
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: Spark on Kubernetes cluster with spark-3.0.0 image. The 
> CRD is ScheduledSparkApplication
>Reporter: John Wesley
>Priority: Minor
>
> I created 2 similar jobs, 
> 1) First job reading from kafka and writing to console sink in append mode
> 2) Second job reading from kafka and writing to foreachBatch sink (which then 
> writes in parquet format to S3).
> The metrics in the log for console shows correct values for numInputRows and 
> numOutputRows whereas they are wrong for foreachBatch.
> With foreachBatch:
> numInputRows is +1 more than what is actually present
> numOutputRows is always -1.
> ///Console sink
> //20/11/05 13:36:21 INFO 
> MicroBatchExecution: Streaming query made progress: {
>   "id" : "775aa543-58bf-4cf7-b274-390da640b6ae",
>   "runId" : "e5eac4ca-0b29-4ed4-be35-b70bd20906d5",
>   "name" : null,
>   "timestamp" : "2020-11-05T13:36:08.921Z",
>   "batchId" : 0,
>   "numInputRows" : 10,
>   "processedRowsPerSecond" : 0.7759757895553658,
>   "durationMs" : {
> "addBatch" : 7735,
> "getBatch" : 152,
> "latestOffset" : 2037,
> "queryPlanning" : 1010,
> "triggerExecution" : 12886,
> "walCommit" : 938
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "KafkaV2[Subscribe[testedr7]]",
> "startOffset" : null,
> "endOffset" : {
>   "testedr7" : {
> "0" : 10
>   }
> },
> "numInputRows" : 10,
> "processedRowsPerSecond" : 0.7759757895553658
>   } ],
>   "sink" : {
> "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleTable$@38c3a814",
> "numOutputRows" : 10
>   }
> }
> ///ForEachBatch Sink
> //20/11/05 13:43:38 INFO 
> MicroBatchExecution: Streaming query made progress: {
>   "id" : "789f9a00-2f2a-4f75-b643-fea201088b4a",
>   "runId" : "b5e695c5-3a2e-4ad2-9dbf-11b69f368f61",
>   "name" : null,
>   "timestamp" : "2020-11-05T13:43:15.421Z",
>   "batchId" : 0,
>   "numInputRows" : 11,
>   "processedRowsPerSecond" : 0.4833252779120348,
>   "durationMs" : {
> "addBatch" : 17689,
> "getBatch" : 135,
> "latestOffset" : 2121,
> "queryPlanning" : 880,
> "triggerExecution" : 22758,
> "walCommit" : 876
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "KafkaV2[Subscribe[testedr7]]",
> "startOffset" : null,
> "endOffset" : {
>   "testedr7" : {
> "0" : 10
>   }
> },
> "numInputRows" : 11,
> "processedRowsPerSecond" : 0.4833252779120348
>   } ],
>   "sink" : {
> "description" : "ForeachBatchSink",
> "numOutputRows" : -1
>   }
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33359) foreachBatch sink outputs wrong metrics

2020-11-05 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-33359.
--
Resolution: Not A Problem

> foreachBatch sink outputs wrong metrics
> ---
>
> Key: SPARK-33359
> URL: https://issues.apache.org/jira/browse/SPARK-33359
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: Spark on Kubernetes cluster with spark-3.0.0 image. The 
> CRD is ScheduledSparkApplication
>Reporter: John Wesley
>Priority: Minor
>
> I created 2 similar jobs, 
> 1) First job reading from kafka and writing to console sink in append mode
> 2) Second job reading from kafka and writing to foreachBatch sink (which then 
> writes in parquet format to S3).
> The metrics in the log for console shows correct values for numInputRows and 
> numOutputRows whereas they are wrong for foreachBatch.
> With foreachBatch:
> numInputRows is +1 more than what is actually present
> numOutputRows is always -1.
> ///Console sink
> //20/11/05 13:36:21 INFO 
> MicroBatchExecution: Streaming query made progress: {
>   "id" : "775aa543-58bf-4cf7-b274-390da640b6ae",
>   "runId" : "e5eac4ca-0b29-4ed4-be35-b70bd20906d5",
>   "name" : null,
>   "timestamp" : "2020-11-05T13:36:08.921Z",
>   "batchId" : 0,
>   "numInputRows" : 10,
>   "processedRowsPerSecond" : 0.7759757895553658,
>   "durationMs" : {
> "addBatch" : 7735,
> "getBatch" : 152,
> "latestOffset" : 2037,
> "queryPlanning" : 1010,
> "triggerExecution" : 12886,
> "walCommit" : 938
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "KafkaV2[Subscribe[testedr7]]",
> "startOffset" : null,
> "endOffset" : {
>   "testedr7" : {
> "0" : 10
>   }
> },
> "numInputRows" : 10,
> "processedRowsPerSecond" : 0.7759757895553658
>   } ],
>   "sink" : {
> "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleTable$@38c3a814",
> "numOutputRows" : 10
>   }
> }
> ///ForEachBatch Sink
> //20/11/05 13:43:38 INFO 
> MicroBatchExecution: Streaming query made progress: {
>   "id" : "789f9a00-2f2a-4f75-b643-fea201088b4a",
>   "runId" : "b5e695c5-3a2e-4ad2-9dbf-11b69f368f61",
>   "name" : null,
>   "timestamp" : "2020-11-05T13:43:15.421Z",
>   "batchId" : 0,
>   "numInputRows" : 11,
>   "processedRowsPerSecond" : 0.4833252779120348,
>   "durationMs" : {
> "addBatch" : 17689,
> "getBatch" : 135,
> "latestOffset" : 2121,
> "queryPlanning" : 880,
> "triggerExecution" : 22758,
> "walCommit" : 876
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "KafkaV2[Subscribe[testedr7]]",
> "startOffset" : null,
> "endOffset" : {
>   "testedr7" : {
> "0" : 10
>   }
> },
> "numInputRows" : 11,
> "processedRowsPerSecond" : 0.4833252779120348
>   } ],
>   "sink" : {
> "description" : "ForeachBatchSink",
> "numOutputRows" : -1
>   }
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33359) foreachBatch sink outputs wrong metrics

2020-11-05 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227029#comment-17227029
 ] 

Jungtaek Lim commented on SPARK-33359:
--

That's by design. The metric is available only for V2 sink. The interface of V1 
sink doesn't open the possibility to count the number of output rows, as it can 
run arbitrary DataFrame operations not just writing, just like we do with 
foreachBatch. -1 means "unknown".

Probably better to document this to avoid confusion, but this by itself is not 
a bug. Let me close this for now. Thanks.

> foreachBatch sink outputs wrong metrics
> ---
>
> Key: SPARK-33359
> URL: https://issues.apache.org/jira/browse/SPARK-33359
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: Spark on Kubernetes cluster with spark-3.0.0 image. The 
> CRD is ScheduledSparkApplication
>Reporter: John Wesley
>Priority: Minor
>
> I created 2 similar jobs, 
> 1) First job reading from kafka and writing to console sink in append mode
> 2) Second job reading from kafka and writing to foreachBatch sink (which then 
> writes in parquet format to S3).
> The metrics in the log for console shows correct values for numInputRows and 
> numOutputRows whereas they are wrong for foreachBatch.
> With foreachBatch:
> numInputRows is +1 more than what is actually present
> numOutputRows is always -1.
> ///Console sink
> //20/11/05 13:36:21 INFO 
> MicroBatchExecution: Streaming query made progress: {
>   "id" : "775aa543-58bf-4cf7-b274-390da640b6ae",
>   "runId" : "e5eac4ca-0b29-4ed4-be35-b70bd20906d5",
>   "name" : null,
>   "timestamp" : "2020-11-05T13:36:08.921Z",
>   "batchId" : 0,
>   "numInputRows" : 10,
>   "processedRowsPerSecond" : 0.7759757895553658,
>   "durationMs" : {
> "addBatch" : 7735,
> "getBatch" : 152,
> "latestOffset" : 2037,
> "queryPlanning" : 1010,
> "triggerExecution" : 12886,
> "walCommit" : 938
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "KafkaV2[Subscribe[testedr7]]",
> "startOffset" : null,
> "endOffset" : {
>   "testedr7" : {
> "0" : 10
>   }
> },
> "numInputRows" : 10,
> "processedRowsPerSecond" : 0.7759757895553658
>   } ],
>   "sink" : {
> "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleTable$@38c3a814",
> "numOutputRows" : 10
>   }
> }
> ///ForEachBatch Sink
> //20/11/05 13:43:38 INFO 
> MicroBatchExecution: Streaming query made progress: {
>   "id" : "789f9a00-2f2a-4f75-b643-fea201088b4a",
>   "runId" : "b5e695c5-3a2e-4ad2-9dbf-11b69f368f61",
>   "name" : null,
>   "timestamp" : "2020-11-05T13:43:15.421Z",
>   "batchId" : 0,
>   "numInputRows" : 11,
>   "processedRowsPerSecond" : 0.4833252779120348,
>   "durationMs" : {
> "addBatch" : 17689,
> "getBatch" : 135,
> "latestOffset" : 2121,
> "queryPlanning" : 880,
> "triggerExecution" : 22758,
> "walCommit" : 876
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "KafkaV2[Subscribe[testedr7]]",
> "startOffset" : null,
> "endOffset" : {
>   "testedr7" : {
> "0" : 10
>   }
> },
> "numInputRows" : 11,
> "processedRowsPerSecond" : 0.4833252779120348
>   } ],
>   "sink" : {
> "description" : "ForeachBatchSink",
> "numOutputRows" : -1
>   }
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33365) Update SBT to 1.4.2

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227027#comment-17227027
 ] 

Apache Spark commented on SPARK-33365:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30268

> Update SBT to 1.4.2
> ---
>
> Key: SPARK-33365
> URL: https://issues.apache.org/jira/browse/SPARK-33365
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33365) Update SBT to 1.4.2

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227026#comment-17227026
 ] 

Apache Spark commented on SPARK-33365:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30268

> Update SBT to 1.4.2
> ---
>
> Key: SPARK-33365
> URL: https://issues.apache.org/jira/browse/SPARK-33365
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33365) Update SBT to 1.4.2

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33365:


Assignee: Apache Spark

> Update SBT to 1.4.2
> ---
>
> Key: SPARK-33365
> URL: https://issues.apache.org/jira/browse/SPARK-33365
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33365) Update SBT to 1.4.2

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33365:


Assignee: (was: Apache Spark)

> Update SBT to 1.4.2
> ---
>
> Key: SPARK-33365
> URL: https://issues.apache.org/jira/browse/SPARK-33365
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33365) Update SBT to 1.4.2

2020-11-05 Thread William Hyun (Jira)

William Hyun created SPARK-33365:


 Summary: Update SBT to 1.4.2
 Key: SPARK-33365
 URL: https://issues.apache.org/jira/browse/SPARK-33365
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1.0
Reporter: William Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33364) Expose purge option in TableCatalog.dropTable

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226950#comment-17226950
 ] 

Apache Spark commented on SPARK-33364:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/30267

> Expose purge option in TableCatalog.dropTable
> -
>
> Key: SPARK-33364
> URL: https://issues.apache.org/jira/browse/SPARK-33364
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> TableCatalog.dropTable currently does not support the purge option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33364) Expose purge option in TableCatalog.dropTable

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33364:


Assignee: Apache Spark

> Expose purge option in TableCatalog.dropTable
> -
>
> Key: SPARK-33364
> URL: https://issues.apache.org/jira/browse/SPARK-33364
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Minor
>
> TableCatalog.dropTable currently does not support the purge option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33364) Expose purge option in TableCatalog.dropTable

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33364:


Assignee: (was: Apache Spark)

> Expose purge option in TableCatalog.dropTable
> -
>
> Key: SPARK-33364
> URL: https://issues.apache.org/jira/browse/SPARK-33364
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> TableCatalog.dropTable currently does not support the purge option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33364) Expose purge option in TableCatalog.dropTable

2020-11-05 Thread Terry Kim (Jira)

Terry Kim created SPARK-33364:
-

 Summary: Expose purge option in TableCatalog.dropTable
 Key: SPARK-33364
 URL: https://issues.apache.org/jira/browse/SPARK-33364
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Terry Kim


TableCatalog.dropTable currently does not support the purge option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33185) YARN: Print direct links to driver logs alongside application report in cluster mode

2020-11-05 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-33185:
---

Assignee: Erik Krogen

> YARN: Print direct links to driver logs alongside application report in 
> cluster mode
> 
>
> Key: SPARK-33185
> URL: https://issues.apache.org/jira/browse/SPARK-33185
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
>
> Currently when run in {{cluster}} mode on YARN, the Spark {{yarn.Client}} 
> will print out the application report into the logs, to be easily viewed by 
> users. For example:
> {code}
> INFO yarn.Client: 
>client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
>diagnostics: N/A
>ApplicationMaster host: X.X.X.X
>ApplicationMaster RPC port: 0
>queue: default
>start time: 1602782566027
>final status: UNDEFINED
>tracking URL: http://hostname:/proxy/application_/
>user: xkrogen
> {code}
> Typically, the tracking URL can be used to find the logs of the 
> ApplicationMaster/driver while the application is running. Later, the Spark 
> History Server can be used to track this information down, using the 
> stdout/stderr links on the Executors page.
> However, in the situation when the driver crashed _before_ writing out a 
> history file, the SHS may not be aware of this application, and thus does not 
> contain links to the driver logs. When this situation arises, it can be 
> difficult for users to debug further, since they can't easily find their 
> driver logs.
> It is possible to reach the logs by using the {{yarn logs}} commands, but the 
> average Spark user isn't aware of this and shouldn't have to be.
> I propose adding, alongside the application report, some additional lines 
> like:
> {code}
>  Driver Logs (stdout): 
> http://hostname:8042/node/containerlogs/container_/xkrogen/stdout?start=-4096
>  Driver Logs (stderr): 
> http://hostname:8042/node/containerlogs/container_/xkrogen/stderr?start=-4096
> {code}
> With this information available, users can quickly jump to their driver logs, 
> even if it crashed before the SHS became aware of the application. This has 
> the additional benefit of providing a quick way to access driver logs, which 
> often contain useful information, in a single click (instead of navigating 
> through the Spark UI).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33185) YARN: Print direct links to driver logs alongside application report in cluster mode

2020-11-05 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-33185.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30096
[https://github.com/apache/spark/pull/30096]

> YARN: Print direct links to driver logs alongside application report in 
> cluster mode
> 
>
> Key: SPARK-33185
> URL: https://issues.apache.org/jira/browse/SPARK-33185
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently when run in {{cluster}} mode on YARN, the Spark {{yarn.Client}} 
> will print out the application report into the logs, to be easily viewed by 
> users. For example:
> {code}
> INFO yarn.Client: 
>client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
>diagnostics: N/A
>ApplicationMaster host: X.X.X.X
>ApplicationMaster RPC port: 0
>queue: default
>start time: 1602782566027
>final status: UNDEFINED
>tracking URL: http://hostname:/proxy/application_/
>user: xkrogen
> {code}
> Typically, the tracking URL can be used to find the logs of the 
> ApplicationMaster/driver while the application is running. Later, the Spark 
> History Server can be used to track this information down, using the 
> stdout/stderr links on the Executors page.
> However, in the situation when the driver crashed _before_ writing out a 
> history file, the SHS may not be aware of this application, and thus does not 
> contain links to the driver logs. When this situation arises, it can be 
> difficult for users to debug further, since they can't easily find their 
> driver logs.
> It is possible to reach the logs by using the {{yarn logs}} commands, but the 
> average Spark user isn't aware of this and shouldn't have to be.
> I propose adding, alongside the application report, some additional lines 
> like:
> {code}
>  Driver Logs (stdout): 
> http://hostname:8042/node/containerlogs/container_/xkrogen/stdout?start=-4096
>  Driver Logs (stderr): 
> http://hostname:8042/node/containerlogs/container_/xkrogen/stderr?start=-4096
> {code}
> With this information available, users can quickly jump to their driver logs, 
> even if it crashed before the SHS became aware of the application. This has 
> the additional benefit of providing a quick way to access driver logs, which 
> often contain useful information, in a single click (instead of navigating 
> through the Spark UI).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33353) Cache dependencies for Coursier with new sbt in GitHub Actions

2020-11-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33353.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30259
[https://github.com/apache/spark/pull/30259]

> Cache dependencies for Coursier with new sbt in GitHub Actions
> --
>
> Key: SPARK-33353
> URL: https://issues.apache.org/jira/browse/SPARK-33353
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0
>
>
> SPARK-33226 upgraded sbt to 1.4.1.
> As of 1.3.0, sbt uses Coursier as the dependency resolver / fetcher.
> So let's change the dependency cache configuration for the GitHub Actions job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33362) skipSchemaResolution should still require query to be resolved

2020-11-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33362.
---
Fix Version/s: 3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30265
[https://github.com/apache/spark/pull/30265]

> skipSchemaResolution should still require query to be resolved
> --
>
> Key: SPARK-33362
> URL: https://issues.apache.org/jira/browse/SPARK-33362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33363) Add prompt information related to the current task when pyspark starts

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33363:


Assignee: (was: Apache Spark)

> Add prompt information related to the current task when pyspark starts
> --
>
> Key: SPARK-33363
> URL: https://issues.apache.org/jira/browse/SPARK-33363
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Priority: Minor
> Attachments: screenshot.png
>
>
> The information printed when pyspark starts does not prompt info such as 
> :current applicationId, application URL, master type, and it is not very 
> convenient



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33363) Add prompt information related to the current task when pyspark starts

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226867#comment-17226867
 ] 

Apache Spark commented on SPARK-33363:
--

User 'akiyamaneko' has created a pull request for this issue:
https://github.com/apache/spark/pull/30266

> Add prompt information related to the current task when pyspark starts
> --
>
> Key: SPARK-33363
> URL: https://issues.apache.org/jira/browse/SPARK-33363
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Priority: Minor
> Attachments: screenshot.png
>
>
> The information printed when pyspark starts does not prompt info such as 
> :current applicationId, application URL, master type, and it is not very 
> convenient



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33363) Add prompt information related to the current task when pyspark starts

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33363:


Assignee: Apache Spark

> Add prompt information related to the current task when pyspark starts
> --
>
> Key: SPARK-33363
> URL: https://issues.apache.org/jira/browse/SPARK-33363
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Assignee: Apache Spark
>Priority: Minor
> Attachments: screenshot.png
>
>
> The information printed when pyspark starts does not prompt info such as 
> :current applicationId, application URL, master type, and it is not very 
> convenient



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33363) Add prompt information related to the current task when pyspark starts

2020-11-05 Thread akiyamaneko (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

akiyamaneko updated SPARK-33363:

Attachment: screenshot.png

> Add prompt information related to the current task when pyspark starts
> --
>
> Key: SPARK-33363
> URL: https://issues.apache.org/jira/browse/SPARK-33363
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Priority: Minor
> Attachments: screenshot.png
>
>
> The information printed when pyspark starts does not prompt info such as 
> :current applicationId, application URL, master type, and it is not very 
> convenient



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33363) Add prompt information related to the current task when pyspark starts

2020-11-05 Thread akiyamaneko (Jira)

akiyamaneko created SPARK-33363:
---

 Summary: Add prompt information related to the current task when 
pyspark starts
 Key: SPARK-33363
 URL: https://issues.apache.org/jira/browse/SPARK-33363
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.1
Reporter: akiyamaneko


The information printed when pyspark starts does not prompt info such as 
:current applicationId, application URL, master type, and it is not very 
convenient



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33349) ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed

2020-11-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226834#comment-17226834
 ] 

Dongjoon Hyun commented on SPARK-33349:
---

Hi, [~redsk]. It's possible if we have a test case. For now, we don't. Given 
that, since 4.12.0 has a breaking API change for that, I'm not sure about that 
the upgrade. When I saw that error, I killed the job. So, I don't know it 
causes a hang or not.

 

> ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed
> --
>
> Key: SPARK-33349
> URL: https://issues.apache.org/jira/browse/SPARK-33349
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.0.2, 3.1.0
>Reporter: Nicola Bova
>Priority: Critical
>
> I launch my spark application with the 
> [spark-on-kubernetes-operator|https://github.com/GoogleCloudPlatform/spark-on-k8s-operator]
>  with the following yaml file:
> {code:yaml}
> apiVersion: sparkoperator.k8s.io/v1beta2
> kind: SparkApplication
> metadata:
>    name: spark-kafka-streamer-test
>    namespace: kafka2hdfs
> spec: 
>    type: Scala
>    mode: cluster
>    image: /spark:3.0.2-SNAPSHOT-2.12-0.1.0
>    imagePullPolicy: Always
>    timeToLiveSeconds: 259200
>    mainClass: path.to.my.class.KafkaStreamer
>    mainApplicationFile: spark-kafka-streamer_2.12-spark300-assembly.jar
>    sparkVersion: 3.0.1
>    restartPolicy:
>  type: Always
>    sparkConf:
>  "spark.kafka.consumer.cache.capacity": "8192"
>  "spark.kubernetes.memoryOverheadFactor": "0.3"
>    deps:
>    jars:
>  - my
>  - jar
>  - list
>    hadoopConfigMap: hdfs-config
>    driver:
>  cores: 4
>  memory: 12g
>  labels:
>    version: 3.0.1
>  serviceAccount: default
>  javaOptions: 
> "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
>   executor:
>  instances: 4
>     cores: 4
>     memory: 16g
>     labels:
>   version: 3.0.1
>     javaOptions: 
> "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
> {code}
>  I have tried with both Spark `3.0.1` and `3.0.2-SNAPSHOT` with the ["Restart 
> the watcher when we receive a version changed from 
> k8s"|https://github.com/apache/spark/pull/29533] patch.
> This is the driver log:
> {code}
> 20/11/04 12:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> ... // my app log, it's a structured streaming app reading from kafka and 
> writing to hdfs
> 20/11/04 13:12:12 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
> been closed (this is expected if the application is shutting down.)
> io.fabric8.kubernetes.client.KubernetesClientException: too old resource 
> version: 1574101276 (1574213896)
>  at 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
>  at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
>  at 
> okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
>  at 
> okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
>  at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
>  at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
>  at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
>  at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)
>  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)
>  at java.base/java.lang.Thread.run(Unknown Source)
> {code}
> The error above appears after roughly 50 minutes.
> After the exception above, no more logs are produced and the app hangs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33362) skipSchemaResolution should still require query to be resolved

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226745#comment-17226745
 ] 

Apache Spark commented on SPARK-33362:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/30265

> skipSchemaResolution should still require query to be resolved
> --
>
> Key: SPARK-33362
> URL: https://issues.apache.org/jira/browse/SPARK-33362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33362) skipSchemaResolution should still require query to be resolved

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33362:


Assignee: Wenchen Fan  (was: Apache Spark)

> skipSchemaResolution should still require query to be resolved
> --
>
> Key: SPARK-33362
> URL: https://issues.apache.org/jira/browse/SPARK-33362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33362) skipSchemaResolution should still require query to be resolved

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33362:


Assignee: Apache Spark  (was: Wenchen Fan)

> skipSchemaResolution should still require query to be resolved
> --
>
> Key: SPARK-33362
> URL: https://issues.apache.org/jira/browse/SPARK-33362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33362) skipSchemaResolution should still require query to be resolved

2020-11-05 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-33362:
---

 Summary: skipSchemaResolution should still require query to be 
resolved
 Key: SPARK-33362
 URL: https://issues.apache.org/jira/browse/SPARK-33362
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33360) simplify DS v2 write resolution

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33360:


Assignee: Wenchen Fan  (was: Apache Spark)

> simplify DS v2 write resolution
> ---
>
> Key: SPARK-33360
> URL: https://issues.apache.org/jira/browse/SPARK-33360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33360) simplify DS v2 write resolution

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33360:


Assignee: Apache Spark  (was: Wenchen Fan)

> simplify DS v2 write resolution
> ---
>
> Key: SPARK-33360
> URL: https://issues.apache.org/jira/browse/SPARK-33360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33360) simplify DS v2 write resolution

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226739#comment-17226739
 ] 

Apache Spark commented on SPARK-33360:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/30264

> simplify DS v2 write resolution
> ---
>
> Key: SPARK-33360
> URL: https://issues.apache.org/jira/browse/SPARK-33360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33360) simplify DS v2 write resolution

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226738#comment-17226738
 ] 

Apache Spark commented on SPARK-33360:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/30264

> simplify DS v2 write resolution
> ---
>
> Key: SPARK-33360
> URL: https://issues.apache.org/jira/browse/SPARK-33360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33361) Dataset.observe() functionality does not work with structured streaming

2020-11-05 Thread John Wesley (Jira)

John Wesley created SPARK-33361:
---

 Summary: Dataset.observe() functionality does not work with 
structured streaming
 Key: SPARK-33361
 URL: https://issues.apache.org/jira/browse/SPARK-33361
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
 Environment: Spark on k8s, version 3.0.0
Reporter: John Wesley


The dataset observe() functionality does not work as expected with spark in 
cluster mode.
Related discussion here:
[http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-emit-custom-metrics-to-Prometheus-in-spark-structured-streaming-td38826.html]

Using lit() as the aggregation column goes through well. However sum, count etc 
returns 0 all the time.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33360) simplify DS v2 write resolution

2020-11-05 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-33360:
---

 Summary: simplify DS v2 write resolution
 Key: SPARK-33360
 URL: https://issues.apache.org/jira/browse/SPARK-33360
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33359) foreachBatch sink outputs wrong metrics

2020-11-05 Thread John Wesley (Jira)

John Wesley created SPARK-33359:
---

 Summary: foreachBatch sink outputs wrong metrics
 Key: SPARK-33359
 URL: https://issues.apache.org/jira/browse/SPARK-33359
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
 Environment: Spark on Kubernetes cluster with spark-3.0.0 image. The 
CRD is ScheduledSparkApplication
Reporter: John Wesley


I created 2 similar jobs, 
1) First job reading from kafka and writing to console sink in append mode
2) Second job reading from kafka and writing to foreachBatch sink (which then 
writes in parquet format to S3).

The metrics in the log for console shows correct values for numInputRows and 
numOutputRows whereas they are wrong for foreachBatch.

With foreachBatch:
numInputRows is +1 more than what is actually present

numOutputRows is always -1.
///Console sink
//20/11/05 13:36:21 INFO 
MicroBatchExecution: Streaming query made progress: {
  "id" : "775aa543-58bf-4cf7-b274-390da640b6ae",
  "runId" : "e5eac4ca-0b29-4ed4-be35-b70bd20906d5",
  "name" : null,
  "timestamp" : "2020-11-05T13:36:08.921Z",
  "batchId" : 0,
  "numInputRows" : 10,
  "processedRowsPerSecond" : 0.7759757895553658,
  "durationMs" : {
"addBatch" : 7735,
"getBatch" : 152,
"latestOffset" : 2037,
"queryPlanning" : 1010,
"triggerExecution" : 12886,
"walCommit" : 938
  },
  "stateOperators" : [ ],
  "sources" : [ {
"description" : "KafkaV2[Subscribe[testedr7]]",
"startOffset" : null,
"endOffset" : {
  "testedr7" : {
"0" : 10
  }
},
"numInputRows" : 10,
"processedRowsPerSecond" : 0.7759757895553658
  } ],
  "sink" : {
"description" : 
"org.apache.spark.sql.execution.streaming.ConsoleTable$@38c3a814",
"numOutputRows" : 10
  }
}


///ForEachBatch Sink
//20/11/05 13:43:38 INFO 
MicroBatchExecution: Streaming query made progress: {
  "id" : "789f9a00-2f2a-4f75-b643-fea201088b4a",
  "runId" : "b5e695c5-3a2e-4ad2-9dbf-11b69f368f61",
  "name" : null,
  "timestamp" : "2020-11-05T13:43:15.421Z",
  "batchId" : 0,
  "numInputRows" : 11,
  "processedRowsPerSecond" : 0.4833252779120348,
  "durationMs" : {
"addBatch" : 17689,
"getBatch" : 135,
"latestOffset" : 2121,
"queryPlanning" : 880,
"triggerExecution" : 22758,
"walCommit" : 876
  },
  "stateOperators" : [ ],
  "sources" : [ {
"description" : "KafkaV2[Subscribe[testedr7]]",
"startOffset" : null,
"endOffset" : {
  "testedr7" : {
"0" : 10
  }
},
"numInputRows" : 11,
"processedRowsPerSecond" : 0.4833252779120348
  } ],
  "sink" : {
"description" : "ForeachBatchSink",
"numOutputRows" : -1
  }
}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results

2020-11-05 Thread Punit Shah (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah updated SPARK-33327:
---
Description: 
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates  ** 
*{color:#de350b}incorrect{color}* {color}results:
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)")

 

  was:
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates 
{color:#de350b}*incorrect*{color}{color} results:
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)")

 


> grouped by first and last against date column returns incorrect results
> ---
>
> Key: SPARK-33327
> URL: https://issues.apache.org/jira/browse/SPARK-33327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7
>Reporter: Punit Shah
>Priority: Major
> Attachments: users.csv
>
>
> The attached csv file has two columns, namely "User" and "FromDate".  The 
> import defaults the "FromDate" column as a timestamp. 
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
>  * outDF.createOrReplaceTempView("table02")
> In this default case the following sql generates 
> {color:#de350b}*incorrect*{color} results:
> {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
> `FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
> count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
> {color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
> is read in as a Date, then the above sql query also generates  ** 
> *{color:#de350b}incorrect{color}* {color}results:
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
> header=True).selectExpr("`User`", "cast(`FromDate` as date)")
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results

2020-11-05 Thread Punit Shah (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah updated SPARK-33327:
---
Description: 
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query {color:#de350b}*also*{color} 
generates  *incorrect* {color} results:
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)")

 

  was:
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates  *incorrect* 
{color} results:
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)")

 


> grouped by first and last against date column returns incorrect results
> ---
>
> Key: SPARK-33327
> URL: https://issues.apache.org/jira/browse/SPARK-33327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7
>Reporter: Punit Shah
>Priority: Major
> Attachments: users.csv
>
>
> The attached csv file has two columns, namely "User" and "FromDate".  The 
> import defaults the "FromDate" column as a timestamp. 
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
>  * outDF.createOrReplaceTempView("table02")
> In this default case the following sql generates 
> {color:#de350b}*incorrect*{color} results:
> {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
> `FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
> count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
> {color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
> is read in as a Date, then the above sql query {color:#de350b}*also*{color} 
> generates  *incorrect* {color} results:
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
> header=True).selectExpr("`User`", "cast(`FromDate` as date)")
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results

2020-11-05 Thread Punit Shah (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah updated SPARK-33327:
---
Description: 
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates  *incorrect* 
{color} results:
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)")

 

  was:
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates  ** 
*{color:#de350b}incorrect{color}* {color}results:
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)")

 


> grouped by first and last against date column returns incorrect results
> ---
>
> Key: SPARK-33327
> URL: https://issues.apache.org/jira/browse/SPARK-33327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7
>Reporter: Punit Shah
>Priority: Major
> Attachments: users.csv
>
>
> The attached csv file has two columns, namely "User" and "FromDate".  The 
> import defaults the "FromDate" column as a timestamp. 
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
>  * outDF.createOrReplaceTempView("table02")
> In this default case the following sql generates 
> {color:#de350b}*incorrect*{color} results:
> {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
> `FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
> count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
> {color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
> is read in as a Date, then the above sql query also generates  *incorrect* 
> {color} results:
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
> header=True).selectExpr("`User`", "cast(`FromDate` as date)")
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results

2020-11-05 Thread Punit Shah (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah updated SPARK-33327:
---
Description: 
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates 
{color:#de350b}*incorrect*{color}{color} results:
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)")

 

  was:
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates 
{color:#de350b}*incorrect*{color} results:{color}
 * {color:#172b4d}outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)"){color}

 


> grouped by first and last against date column returns incorrect results
> ---
>
> Key: SPARK-33327
> URL: https://issues.apache.org/jira/browse/SPARK-33327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7
>Reporter: Punit Shah
>Priority: Major
> Attachments: users.csv
>
>
> The attached csv file has two columns, namely "User" and "FromDate".  The 
> import defaults the "FromDate" column as a timestamp. 
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
>  * outDF.createOrReplaceTempView("table02")
> In this default case the following sql generates 
> {color:#de350b}*incorrect*{color} results:
> {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
> `FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
> count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
> {color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
> is read in as a Date, then the above sql query also generates 
> {color:#de350b}*incorrect*{color}{color} results:
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
> header=True).selectExpr("`User`", "cast(`FromDate` as date)")
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results

2020-11-05 Thread Punit Shah (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah updated SPARK-33327:
---
Description: 
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates 
{color:#de350b}*incorrect*{color} results:{color}
 * {color:#172b4d}outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)"){color}

 

  was:
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates correct results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query generates incorrect 
results:{color}
 * {color:#172b4d}outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)"){color}

 


> grouped by first and last against date column returns incorrect results
> ---
>
> Key: SPARK-33327
> URL: https://issues.apache.org/jira/browse/SPARK-33327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7
>Reporter: Punit Shah
>Priority: Major
> Attachments: users.csv
>
>
> The attached csv file has two columns, namely "User" and "FromDate".  The 
> import defaults the "FromDate" column as a timestamp. 
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
>  * outDF.createOrReplaceTempView("table02")
> In this default case the following sql generates 
> {color:#de350b}*incorrect*{color} results:
> {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
> `FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
> count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
> {color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
> is read in as a Date, then the above sql query also generates 
> {color:#de350b}*incorrect*{color} results:{color}
>  * {color:#172b4d}outDF = spark_session.read.csv("users.csv", 
> inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as 
> date)"){color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33327) grouped by first and last against date column returns incorrect results

2020-11-05 Thread Punit Shah (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226712#comment-17226712
 ] 

Punit Shah commented on SPARK-33327:


The correct behaviour of running the query should be:

cnt, FromDate_First, FromDate_Last, cntdist

15, 2013-02-21, 2013-12-13, 4

or:

cnt, FromDate_First, FromDate_Last, cntdist

15, 2013-02-21 00:00:00, 2013-12-13 00:00:00, 4

Thanks for asking [~hyukjin.kwon]  Now I notice that both imports fail as shown 
below:

The spark_session.read.csv("users.csv", inferSchema=True, header=True) behaves 
incorrectly like:

cnt, FromDate_First, FromDate_Last, cntdist

15, 2013-12-13 00:00:00, 2013-03-18 00:00:00, 4

The spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)") also behaves 
incorrectly like so:

cnt, FromDate_First, FromDate_Last, cntdist

15, 2013-12-13 , 2013-02-21 , 4

> grouped by first and last against date column returns incorrect results
> ---
>
> Key: SPARK-33327
> URL: https://issues.apache.org/jira/browse/SPARK-33327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7
>Reporter: Punit Shah
>Priority: Major
> Attachments: users.csv
>
>
> The attached csv file has two columns, namely "User" and "FromDate".  The 
> import defaults the "FromDate" column as a timestamp. 
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
>  * outDF.createOrReplaceTempView("table02")
> In this default case the following sql generates correct results:
> {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
> `FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
> count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
> {color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
> is read in as a Date, then the above sql query generates incorrect 
> results:{color}
>  * {color:#172b4d}outDF = spark_session.read.csv("users.csv", 
> inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as 
> date)"){color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33349) ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed

2020-11-05 Thread Nicola Bova (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226665#comment-17226665
 ] 

Nicola Bova commented on SPARK-33349:
-

[~dongjoon] Would you consider upgrading kubernetes-client to 4.12.0? I'm not 
100% sure but this issue seems to completely hang the spark application on k8s, 
at least this is what I think it's happening in my case. Did the spark app hang 
when you encountered this problem?

> ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed
> --
>
> Key: SPARK-33349
> URL: https://issues.apache.org/jira/browse/SPARK-33349
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.0.2, 3.1.0
>Reporter: Nicola Bova
>Priority: Critical
>
> I launch my spark application with the 
> [spark-on-kubernetes-operator|https://github.com/GoogleCloudPlatform/spark-on-k8s-operator]
>  with the following yaml file:
> {code:yaml}
> apiVersion: sparkoperator.k8s.io/v1beta2
> kind: SparkApplication
> metadata:
>    name: spark-kafka-streamer-test
>    namespace: kafka2hdfs
> spec: 
>    type: Scala
>    mode: cluster
>    image: /spark:3.0.2-SNAPSHOT-2.12-0.1.0
>    imagePullPolicy: Always
>    timeToLiveSeconds: 259200
>    mainClass: path.to.my.class.KafkaStreamer
>    mainApplicationFile: spark-kafka-streamer_2.12-spark300-assembly.jar
>    sparkVersion: 3.0.1
>    restartPolicy:
>  type: Always
>    sparkConf:
>  "spark.kafka.consumer.cache.capacity": "8192"
>  "spark.kubernetes.memoryOverheadFactor": "0.3"
>    deps:
>    jars:
>  - my
>  - jar
>  - list
>    hadoopConfigMap: hdfs-config
>    driver:
>  cores: 4
>  memory: 12g
>  labels:
>    version: 3.0.1
>  serviceAccount: default
>  javaOptions: 
> "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
>   executor:
>  instances: 4
>     cores: 4
>     memory: 16g
>     labels:
>   version: 3.0.1
>     javaOptions: 
> "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
> {code}
>  I have tried with both Spark `3.0.1` and `3.0.2-SNAPSHOT` with the ["Restart 
> the watcher when we receive a version changed from 
> k8s"|https://github.com/apache/spark/pull/29533] patch.
> This is the driver log:
> {code}
> 20/11/04 12:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> ... // my app log, it's a structured streaming app reading from kafka and 
> writing to hdfs
> 20/11/04 13:12:12 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
> been closed (this is expected if the application is shutting down.)
> io.fabric8.kubernetes.client.KubernetesClientException: too old resource 
> version: 1574101276 (1574213896)
>  at 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
>  at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
>  at 
> okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
>  at 
> okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
>  at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
>  at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
>  at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
>  at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)
>  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)
>  at java.base/java.lang.Thread.run(Unknown Source)
> {code}
> The error above appears after roughly 50 minutes.
> After the exception above, no more logs are produced and the app hangs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33358) Spark SQL CLI command processing loop can't exit while one comand fail

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33358:


Assignee: (was: Apache Spark)

> Spark SQL CLI command processing loop can't exit while one comand fail
> --
>
> Key: SPARK-33358
> URL: https://issues.apache.org/jira/browse/SPARK-33358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Lichuanliang
>Priority: Major
>
>  
> When submit a multiple statements sql script through bin/spark-sql, if one of 
> the prior command fail, the processing loop will not exit and continuing 
> executing the following statements, and finally makes the whole program 
> success. 
>  
> {code:java}
> for (oneCmd <- commands) {
>   if (StringUtils.endsWith(oneCmd, "\\")) {
> command += StringUtils.chop(oneCmd) + ";"
>   } else {
> command += oneCmd
> if (!StringUtils.isBlank(command)) {
>   val ret = processCmd(command)
>   command = ""
>   lastRet = ret
>   val ignoreErrors = HiveConf.getBoolVar(conf, 
> HiveConf.ConfVars.CLIIGNOREERRORS)
>   if (ret != 0 && !ignoreErrors) {
> CommandProcessorFactory.clean(conf.asInstanceOf[HiveConf])
> ret // for loop will not return if one of the commands fail
>   }
> }
>   }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33358) Spark SQL CLI command processing loop can't exit while one comand fail

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33358:


Assignee: Apache Spark

> Spark SQL CLI command processing loop can't exit while one comand fail
> --
>
> Key: SPARK-33358
> URL: https://issues.apache.org/jira/browse/SPARK-33358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Lichuanliang
>Assignee: Apache Spark
>Priority: Major
>
>  
> When submit a multiple statements sql script through bin/spark-sql, if one of 
> the prior command fail, the processing loop will not exit and continuing 
> executing the following statements, and finally makes the whole program 
> success. 
>  
> {code:java}
> for (oneCmd <- commands) {
>   if (StringUtils.endsWith(oneCmd, "\\")) {
> command += StringUtils.chop(oneCmd) + ";"
>   } else {
> command += oneCmd
> if (!StringUtils.isBlank(command)) {
>   val ret = processCmd(command)
>   command = ""
>   lastRet = ret
>   val ignoreErrors = HiveConf.getBoolVar(conf, 
> HiveConf.ConfVars.CLIIGNOREERRORS)
>   if (ret != 0 && !ignoreErrors) {
> CommandProcessorFactory.clean(conf.asInstanceOf[HiveConf])
> ret // for loop will not return if one of the commands fail
>   }
> }
>   }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33358) Spark SQL CLI command processing loop can't exit while one comand fail

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226629#comment-17226629
 ] 

Apache Spark commented on SPARK-33358:
--

User 'artiship' has created a pull request for this issue:
https://github.com/apache/spark/pull/30263

> Spark SQL CLI command processing loop can't exit while one comand fail
> --
>
> Key: SPARK-33358
> URL: https://issues.apache.org/jira/browse/SPARK-33358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Lichuanliang
>Priority: Major
>
>  
> When submit a multiple statements sql script through bin/spark-sql, if one of 
> the prior command fail, the processing loop will not exit and continuing 
> executing the following statements, and finally makes the whole program 
> success. 
>  
> {code:java}
> for (oneCmd <- commands) {
>   if (StringUtils.endsWith(oneCmd, "\\")) {
> command += StringUtils.chop(oneCmd) + ";"
>   } else {
> command += oneCmd
> if (!StringUtils.isBlank(command)) {
>   val ret = processCmd(command)
>   command = ""
>   lastRet = ret
>   val ignoreErrors = HiveConf.getBoolVar(conf, 
> HiveConf.ConfVars.CLIIGNOREERRORS)
>   if (ret != 0 && !ignoreErrors) {
> CommandProcessorFactory.clean(conf.asInstanceOf[HiveConf])
> ret // for loop will not return if one of the commands fail
>   }
> }
>   }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33358) Spark SQL CLI command processing loop can't exit while one comand fail

2020-11-05 Thread Lichuanliang (Jira)

Lichuanliang created SPARK-33358:


 Summary: Spark SQL CLI command processing loop can't exit while 
one comand fail
 Key: SPARK-33358
 URL: https://issues.apache.org/jira/browse/SPARK-33358
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: Lichuanliang


 

When submit a multiple statements sql script through bin/spark-sql, if one of 
the prior command fail, the processing loop will not exit and continuing 
executing the following statements, and finally makes the whole program 
success. 

 
{code:java}
for (oneCmd <- commands) {
  if (StringUtils.endsWith(oneCmd, "\\")) {
command += StringUtils.chop(oneCmd) + ";"
  } else {
command += oneCmd
if (!StringUtils.isBlank(command)) {
  val ret = processCmd(command)
  command = ""
  lastRet = ret
  val ignoreErrors = HiveConf.getBoolVar(conf, 
HiveConf.ConfVars.CLIIGNOREERRORS)
  if (ret != 0 && !ignoreErrors) {
CommandProcessorFactory.clean(conf.asInstanceOf[HiveConf])
ret // for loop will not return if one of the commands fail
  }
}
  }

{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33356) DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33356:


Assignee: (was: Apache Spark)

> DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion
> -
>
> Key: SPARK-33356
> URL: https://issues.apache.org/jira/browse/SPARK-33356
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.2, 3.0.1
> Environment: Reproducible locally with 3.0.1, 2.4.2, and latest 
> master.
>Reporter: Lucas Brutschy
>Priority: Minor
>
> The current implementation of the {{DAGScheduler}} exhibits exponential 
> runtime in DAGs with many {{PartitionerAwareUnions}}. The reason seems to be 
> a mutual recursion between {{PartitionerAwareUnion.getPreferredLocations}} 
> and {{DAGScheduler.getPreferredLocs}}.
> A minimal example reproducing the issue:
> {code:scala}
> object Example extends App {
>   val partitioner = new HashPartitioner(2)
>   val sc = new SparkContext(new 
> SparkConf().setAppName("").setMaster("local[*]"))
>   val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner)
>   val rdd2 = (1 to 30).map(_ => rdd1)
>   val rdd3 = rdd2.reduce(_ union _)
>   rdd3.collect()
> }
> {code}
> The whole app should take around one second to complete, as no actual work is 
> done. However, it takes more time to submit the job than I am willing to wait.
> The underlying cause appears to be mutual recursion between 
> {{PartitionerAwareUnion.getPreferredLocations}} and 
> {{DAGScheduler.getPreferredLocs}}, which restarts graph traversal at each 
> {{PartitionerAwareUnion}} with no memoization. Each node of the DAG is 
> visited {{O(n!)}} (exponentially many) times.
> Note, that it is clear to me that you could use {{sc.union(rdd2)}} instead of 
> {{rdd2.reduce(_ union _)}} to eliminate the problem. I use this just to 
> demonstrate the issue in a sufficiently small example. Given a large DAG and 
> many PartitionerAwareUnions, especially contructed by iterative algorithms, 
> the problem can become relevant even without "abuse" of the union operation.
> The exponential recursion in DAG Schedular was largely fixed with SPARK-682, 
> but in the special case of PartitionerAwareUnion, it is still possible. This 
> may actually be an underlying cause of SPARK-29181.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33356) DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226611#comment-17226611
 ] 

Apache Spark commented on SPARK-33356:
--

User 'lucasbru' has created a pull request for this issue:
https://github.com/apache/spark/pull/30262

> DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion
> -
>
> Key: SPARK-33356
> URL: https://issues.apache.org/jira/browse/SPARK-33356
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.2, 3.0.1
> Environment: Reproducible locally with 3.0.1, 2.4.2, and latest 
> master.
>Reporter: Lucas Brutschy
>Priority: Minor
>
> The current implementation of the {{DAGScheduler}} exhibits exponential 
> runtime in DAGs with many {{PartitionerAwareUnions}}. The reason seems to be 
> a mutual recursion between {{PartitionerAwareUnion.getPreferredLocations}} 
> and {{DAGScheduler.getPreferredLocs}}.
> A minimal example reproducing the issue:
> {code:scala}
> object Example extends App {
>   val partitioner = new HashPartitioner(2)
>   val sc = new SparkContext(new 
> SparkConf().setAppName("").setMaster("local[*]"))
>   val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner)
>   val rdd2 = (1 to 30).map(_ => rdd1)
>   val rdd3 = rdd2.reduce(_ union _)
>   rdd3.collect()
> }
> {code}
> The whole app should take around one second to complete, as no actual work is 
> done. However, it takes more time to submit the job than I am willing to wait.
> The underlying cause appears to be mutual recursion between 
> {{PartitionerAwareUnion.getPreferredLocations}} and 
> {{DAGScheduler.getPreferredLocs}}, which restarts graph traversal at each 
> {{PartitionerAwareUnion}} with no memoization. Each node of the DAG is 
> visited {{O(n!)}} (exponentially many) times.
> Note, that it is clear to me that you could use {{sc.union(rdd2)}} instead of 
> {{rdd2.reduce(_ union _)}} to eliminate the problem. I use this just to 
> demonstrate the issue in a sufficiently small example. Given a large DAG and 
> many PartitionerAwareUnions, especially contructed by iterative algorithms, 
> the problem can become relevant even without "abuse" of the union operation.
> The exponential recursion in DAG Schedular was largely fixed with SPARK-682, 
> but in the special case of PartitionerAwareUnion, it is still possible. This 
> may actually be an underlying cause of SPARK-29181.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33356) DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion

2020-11-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33356:


Assignee: Apache Spark

> DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion
> -
>
> Key: SPARK-33356
> URL: https://issues.apache.org/jira/browse/SPARK-33356
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.2, 3.0.1
> Environment: Reproducible locally with 3.0.1, 2.4.2, and latest 
> master.
>Reporter: Lucas Brutschy
>Assignee: Apache Spark
>Priority: Minor
>
> The current implementation of the {{DAGScheduler}} exhibits exponential 
> runtime in DAGs with many {{PartitionerAwareUnions}}. The reason seems to be 
> a mutual recursion between {{PartitionerAwareUnion.getPreferredLocations}} 
> and {{DAGScheduler.getPreferredLocs}}.
> A minimal example reproducing the issue:
> {code:scala}
> object Example extends App {
>   val partitioner = new HashPartitioner(2)
>   val sc = new SparkContext(new 
> SparkConf().setAppName("").setMaster("local[*]"))
>   val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner)
>   val rdd2 = (1 to 30).map(_ => rdd1)
>   val rdd3 = rdd2.reduce(_ union _)
>   rdd3.collect()
> }
> {code}
> The whole app should take around one second to complete, as no actual work is 
> done. However, it takes more time to submit the job than I am willing to wait.
> The underlying cause appears to be mutual recursion between 
> {{PartitionerAwareUnion.getPreferredLocations}} and 
> {{DAGScheduler.getPreferredLocs}}, which restarts graph traversal at each 
> {{PartitionerAwareUnion}} with no memoization. Each node of the DAG is 
> visited {{O(n!)}} (exponentially many) times.
> Note, that it is clear to me that you could use {{sc.union(rdd2)}} instead of 
> {{rdd2.reduce(_ union _)}} to eliminate the problem. I use this just to 
> demonstrate the issue in a sufficiently small example. Given a large DAG and 
> many PartitionerAwareUnions, especially contructed by iterative algorithms, 
> the problem can become relevant even without "abuse" of the union operation.
> The exponential recursion in DAG Schedular was largely fixed with SPARK-682, 
> but in the special case of PartitionerAwareUnion, it is still possible. This 
> may actually be an underlying cause of SPARK-29181.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33357) Support SparkLauncher in Kubernetes

2020-11-05 Thread hong dongdong (Jira)

hong dongdong created SPARK-33357:
-

 Summary: Support SparkLauncher in Kubernetes
 Key: SPARK-33357
 URL: https://issues.apache.org/jira/browse/SPARK-33357
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 3.0.1
Reporter: hong dongdong


Now, SparkAppHandle can not get state report in k8s, we can support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33356) DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion

2020-11-05 Thread Lucas Brutschy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy updated SPARK-33356:
---
Description: 
The current implementation of the {{DAGScheduler}} exhibits exponential runtime 
in DAGs with many {{PartitionerAwareUnions}}. The reason seems to be a mutual 
recursion between {{PartitionerAwareUnion.getPreferredLocations}} and 
{{DAGScheduler.getPreferredLocs}}.

A minimal example reproducing the issue:
{code:scala}
object Example extends App {
  val partitioner = new HashPartitioner(2)
  val sc = new SparkContext(new 
SparkConf().setAppName("").setMaster("local[*]"))
  val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner)
  val rdd2 = (1 to 30).map(_ => rdd1)
  val rdd3 = rdd2.reduce(_ union _)
  rdd3.collect()
}
{code}
The whole app should take around one second to complete, as no actual work is 
done. However, it takes more time to submit the job than I am willing to wait.

The underlying cause appears to be mutual recursion between 
{{PartitionerAwareUnion.getPreferredLocations}} and 
{{DAGScheduler.getPreferredLocs}}, which restarts graph traversal at each 
{{PartitionerAwareUnion}} with no memoization. Each node of the DAG is visited 
{{O(n!)}} (exponentially many) times.

Note, that it is clear to me that you could use {{sc.union(rdd2)}} instead of 
{{rdd2.reduce(_ union _)}} to eliminate the problem. I use this just to 
demonstrate the issue in a sufficiently small example. Given a large DAG and 
many PartitionerAwareUnions, especially contructed by iterative algorithms, the 
problem can become relevant even without "abuse" of the union operation.

The exponential recursion in DAG Schedular was largely fixed with SPARK-682, 
but in the special case of PartitionerAwareUnion, it is still possible. This 
may actually be an underlying cause of SPARK-29181.

  was:
The current implementation of the {{DAGScheduler}} exhibits exponential runtime 
in DAGs with many {{PartitionerAwareUnions}}. The reason seems to be a mutual 
recursion between {{PartitionerAwareUnion.getPreferredLocations}} and 
{{DAGScheduler.getPreferredLocs}}.

A minimal example reproducing the issue:
{code:java}
object Example extends App {
  val partitioner = new HashPartitioner(2)
  val sc = new SparkContext(new 
SparkConf().setAppName("").setMaster("local[*]"))
  val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner)
  val rdd2 = (1 to 30).map(_ => rdd1)
  val rdd3 = rdd2.reduce(_ union _)
  rdd3.collect()
}
{code}
The whole app should take around one second to complete, as no actual work is 
done. However, it takes more time to submit the job than I am willing to wait.

The underlying cause appears to be mutual recursion between 
{{PartitionerAwareUnion.getPreferredLocations}} and 
{{DAGScheduler.getPreferredLocs}}, which restarts graph traversal at each 
{{PartitionerAwareUnion}} with no memoization. Each node of the DAG is visited 
{{O(n!)}} (exponentially many) times.

Note, that it is clear to me that you could use {{sc.union(rdd2)}} instead of 
{{rdd2.reduce(_ union _)}} to eliminate the problem. I use this just to 
demonstrate the issue in a sufficiently small example. Given a large DAG and 
many PartitionerAwareUnions, especially contructed by iterative algorithms, the 
problem can become relevant even without "abuse" of the union operation.

The exponential recursion in DAG Schedular was largely fixed with SPARK-682, 
but in the special case of PartitionerAwareUnion, it is still possible. This 
may actually be an underlying cause of SPARK-29181.


> DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion
> -
>
> Key: SPARK-33356
> URL: https://issues.apache.org/jira/browse/SPARK-33356
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.2, 3.0.1
> Environment: Reproducible locally with 3.0.1, 2.4.2, and latest 
> master.
>Reporter: Lucas Brutschy
>Priority: Minor
>
> The current implementation of the {{DAGScheduler}} exhibits exponential 
> runtime in DAGs with many {{PartitionerAwareUnions}}. The reason seems to be 
> a mutual recursion between {{PartitionerAwareUnion.getPreferredLocations}} 
> and {{DAGScheduler.getPreferredLocs}}.
> A minimal example reproducing the issue:
> {code:scala}
> object Example extends App {
>   val partitioner = new HashPartitioner(2)
>   val sc = new SparkContext(new 
> SparkConf().setAppName("").setMaster("local[*]"))
>   val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner)
>   val rdd2 = (1 to 30).map(_ => rdd1)
>   val rdd3 = rdd2.reduce(_ union _)
>   rdd3.collect()
> }
> {code}
> The whole app should take around one second to complete, as no actual work is 
> done. However, it takes more time to submit the job than I am willing to wait.
> The und

[jira] [Created] (SPARK-33356) DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion

2020-11-05 Thread Lucas Brutschy (Jira)

Lucas Brutschy created SPARK-33356:
--

 Summary: DAG Scheduler exhibits exponential runtime with 
PartitionerAwareUnion
 Key: SPARK-33356
 URL: https://issues.apache.org/jira/browse/SPARK-33356
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1, 2.4.2
 Environment: Reproducible locally with 3.0.1, 2.4.2, and latest master.
Reporter: Lucas Brutschy


The current implementation of the {{DAGScheduler}} exhibits exponential runtime 
in DAGs with many {{PartitionerAwareUnions}}. The reason seems to be a mutual 
recursion between {{PartitionerAwareUnion.getPreferredLocations}} and 
{{DAGScheduler.getPreferredLocs}}.

A minimal example reproducing the issue:
{code:java}
object Example extends App {
  val partitioner = new HashPartitioner(2)
  val sc = new SparkContext(new 
SparkConf().setAppName("").setMaster("local[*]"))
  val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner)
  val rdd2 = (1 to 30).map(_ => rdd1)
  val rdd3 = rdd2.reduce(_ union _)
  rdd3.collect()
}
{code}
The whole app should take around one second to complete, as no actual work is 
done. However, it takes more time to submit the job than I am willing to wait.

The underlying cause appears to be mutual recursion between 
{{PartitionerAwareUnion.getPreferredLocations}} and 
{{DAGScheduler.getPreferredLocs}}, which restarts graph traversal at each 
{{PartitionerAwareUnion}} with no memoization. Each node of the DAG is visited 
{{O(n!)}} (exponentially many) times.

Note, that it is clear to me that you could use {{sc.union(rdd2)}} instead of 
{{rdd2.reduce(_ union _)}} to eliminate the problem. I use this just to 
demonstrate the issue in a sufficiently small example. Given a large DAG and 
many PartitionerAwareUnions, especially contructed by iterative algorithms, the 
problem can become relevant even without "abuse" of the union operation.

The exponential recursion in DAG Schedular was largely fixed with SPARK-682, 
but in the special case of PartitionerAwareUnion, it is still possible. This 
may actually be an underlying cause of SPARK-29181.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30294) Read-only state store unnecessarily creates and deletes the temp file for delta file every batch

2020-11-05 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-30294.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 26935
[https://github.com/apache/spark/pull/26935]

> Read-only state store unnecessarily creates and deletes the temp file for 
> delta file every batch
> 
>
> Key: SPARK-30294
> URL: https://issues.apache.org/jira/browse/SPARK-30294
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.1.0
>
>
> [https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155]
> {code:java}
> /** Abort all the updates made on this store. This store will not be 
> usable any more. */
> override def abort(): Unit = {
>   // This if statement is to ensure that files are deleted only if there 
> are changes to the
>   // StateStore. We have two StateStores for each task, one which is used 
> only for reading, and
>   // the other used for read+write. We don't want the read-only to delete 
> state files.
>   if (state == UPDATING) {
> state = ABORTED
> cancelDeltaFile(compressedStream, deltaFileStream)
>   } else {
> state = ABORTED
>   }
>   logInfo(s"Aborted version $newVersion for $this")
> } {code}
> Despite of the comment, read-only state store also does the same things for 
> preparing write - creates the temporary file, initializes output streams for 
> the file, closes these output streams, and deletes the temporary file. That 
> is just unnecessary and gives confusion as according to the log messages two 
> different instances seem to write to same delta file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30294) Read-only state store unnecessarily creates and deletes the temp file for delta file every batch

2020-11-05 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-30294:


Assignee: Jungtaek Lim

> Read-only state store unnecessarily creates and deletes the temp file for 
> delta file every batch
> 
>
> Key: SPARK-30294
> URL: https://issues.apache.org/jira/browse/SPARK-30294
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
>
> [https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155]
> {code:java}
> /** Abort all the updates made on this store. This store will not be 
> usable any more. */
> override def abort(): Unit = {
>   // This if statement is to ensure that files are deleted only if there 
> are changes to the
>   // StateStore. We have two StateStores for each task, one which is used 
> only for reading, and
>   // the other used for read+write. We don't want the read-only to delete 
> state files.
>   if (state == UPDATING) {
> state = ABORTED
> cancelDeltaFile(compressedStream, deltaFileStream)
>   } else {
> state = ABORTED
>   }
>   logInfo(s"Aborted version $newVersion for $this")
> } {code}
> Despite of the comment, read-only state store also does the same things for 
> preparing write - creates the temporary file, initializes output streams for 
> the file, closes these output streams, and deletes the temporary file. That 
> is just unnecessary and gives confusion as according to the log messages two 
> different instances seem to write to same delta file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33336) provided an invalid fromOffset when spark streaming running

2020-11-05 Thread Yu Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226592#comment-17226592
 ] 

Yu Wang commented on SPARK-6:
-

[~hyukjin.kwon] thanks for your answer , someone has tried the higher version 
of Spark. 
https://issues.apache.org/jira/browse/KAFKA-8124

> provided an invalid fromOffset when spark streaming running
> ---
>
> Key: SPARK-6
> URL: https://issues.apache.org/jira/browse/SPARK-6
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.2
>Reporter: Yu Wang
>Priority: Major
> Attachments: image-2020-11-04-17-24-23-782.png, 
> image-2020-11-04-17-24-42-544.png
>
>
> !image-2020-11-04-17-24-23-782.png|width=923,height=386!
>  
>  
> !image-2020-11-04-17-24-42-544.png|width=1199,height=260!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33355) Batch fix compilation warnings about "method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is deprecated (since 2.13.0)"

2020-11-05 Thread Yang Jie (Jira)

Yang Jie created SPARK-33355:


 Summary: Batch fix compilation warnings about "method 
copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is deprecated 
(since 2.13.0)"
 Key: SPARK-33355
 URL: https://issues.apache.org/jira/browse/SPARK-33355
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.1.0
Reporter: Yang Jie


compilation warnings as follow：
{code:java}
[WARNING] [Warn] 
/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala:1711:
 method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is 
deprecated (since 2.13.0): Implicit conversions from Array to 
immutable.IndexedSeq are implemented by copying; Use the more efficient 
non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call
{code}
this Jira is a placeholder now,  wait until Scala 2.12 is no longer supported



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32934) Improve the performance for NTH_VALUE and Reactor the OffsetWindowFunction

2020-11-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226587#comment-17226587
 ] 

Apache Spark commented on SPARK-32934:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/30261

> Improve the performance for NTH_VALUE and Reactor the OffsetWindowFunction
> --
>
> Key: SPARK-32934
> URL: https://issues.apache.org/jira/browse/SPARK-32934
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> Spark SQL support some window function like  NTH_VALUE
> If we specify window frame like
> {code:java}
> UNBOUNDED PRECEDING AND CURRENT ROW
> {code}
> or
> {code:java}
> UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> We can elimate some calculations.
> For example: if we execute the SQL show below:
> {code:java}
> SELECT NTH_VALUE(col,
>  2) OVER(ORDER BY rank UNBOUNDED PRECEDING
> AND CURRENT ROW)
> FROM tab;
> {code}
> The output for row number greater than 1, return the fixed value. otherwise, 
> return null. So we just calculate the value once and notice whether the row 
> number less than 2.
> UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is simpler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33339) Pyspark application will hang due to non Exception

2020-11-05 Thread lrz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lrz updated SPARK-9:

Description: 
When a system.exit exception occurs during the process, the python worker exits 
abnormally, and then the executor task is still waiting for the worker for 
reading from socket, causing it to hang.
The system.exit exception may be caused by the user's error code, but spark 
should at least throw an error to remind the user, not get stuck
we can run a simple test to reproduce this case:

```

from pyspark.sql import SparkSession

def err(line):

  raise SystemExit

spark = SparkSession.builder.appName("test").getOrCreate()

spark.sparkContext.parallelize(range(1,2), 2).map(err).collect()

spark.stop()

``` 
 

  was:
at pyspark application, worker don't catch BaseException, then once worker call 
system.exit because of some error, the application will hangup, and will not 
throw any exception, then the fail cause is not easy to find.

for example,  run `spark-submit --master yarn-client test.py`,  this command 
will hangup without any information. The test.py content:

```

from pyspark.sql import SparkSession

def err(line):

  raise SystemExit

spark = SparkSession.builder.appName("test").getOrCreate()

spark.sparkContext.parallelize(range(1,2), 2).map(err).collect()

spark.stop()

``` 

Summary: Pyspark application will hang due to non Exception  (was: 
pyspark application maybe hangup because of worker exit)

> Pyspark application will hang due to non Exception
> --
>
> Key: SPARK-9
> URL: https://issues.apache.org/jira/browse/SPARK-9
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0, 3.0.1
>Reporter: lrz
>Priority: Major
>
> When a system.exit exception occurs during the process, the python worker 
> exits abnormally, and then the executor task is still waiting for the worker 
> for reading from socket, causing it to hang.
> The system.exit exception may be caused by the user's error code, but spark 
> should at least throw an error to remind the user, not get stuck
> we can run a simple test to reproduce this case:
> ```
> from pyspark.sql import SparkSession
> def err(line):
>   raise SystemExit
> spark = SparkSession.builder.appName("test").getOrCreate()
> spark.sparkContext.parallelize(range(1,2), 2).map(err).collect()
> spark.stop()
> ``` 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 105 matches

Mail list logo