[jira] [Assigned] (SPARK-23432) Expose executor memory metrics in the web UI for executors
[ https://issues.apache.org/jira/browse/SPARK-23432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-23432: Assignee: Zhongwei Zhu > Expose executor memory metrics in the web UI for executors > -- > > Key: SPARK-23432 > URL: https://issues.apache.org/jira/browse/SPARK-23432 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Edward Lu >Assignee: Zhongwei Zhu >Priority: Major > Fix For: 3.1.0 > > > Add the new memory metrics (jvmUsedMemory, executionMemory, storageMemory, > and unifiedMemory, etc.) to the executors tab, in the summary and for each > executor. > This is a subtask for SPARK-23206. Please refer to the design doc for that > ticket for more details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23432) Expose executor memory metrics in the web UI for executors
[ https://issues.apache.org/jira/browse/SPARK-23432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-23432. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30186 [https://github.com/apache/spark/pull/30186] > Expose executor memory metrics in the web UI for executors > -- > > Key: SPARK-23432 > URL: https://issues.apache.org/jira/browse/SPARK-23432 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Edward Lu >Priority: Major > Fix For: 3.1.0 > > > Add the new memory metrics (jvmUsedMemory, executionMemory, storageMemory, > and unifiedMemory, etc.) to the executors tab, in the summary and for each > executor. > This is a subtask for SPARK-23206. Please refer to the design doc for that > ticket for more details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33333) Upgrade Jetty to 9.4.28.v20200408
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227217#comment-17227217 ] Apache Spark commented on SPARK-3: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/30276 > Upgrade Jetty to 9.4.28.v20200408 > - > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.7, 3.0.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33302) Failed to push down filters through Expand
[ https://issues.apache.org/jira/browse/SPARK-33302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227212#comment-17227212 ] angerszhu commented on SPARK-33302: --- working on this > Failed to push down filters through Expand > -- > > Key: SPARK-33302 > URL: https://issues.apache.org/jira/browse/SPARK-33302 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.4, 3.0.1, 3.1.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:sql} > create table SPARK_33302_1(pid int, uid int, sid int, dt date, suid int) > using parquet; > create table SPARK_33302_2(pid int, vs int, uid int, csid int) using parquet; > SELECT >years, >appversion, >SUM(uusers) AS users > FROM (SELECT >Date_trunc('year', dt) AS years, >CASE > WHEN h.pid = 3 THEN 'iOS' > WHEN h.pid = 4 THEN 'Android' > ELSE 'Other' >END AS viewport, >h.vsAS appversion, >Count(DISTINCT u.uid) AS uusers >,Count(DISTINCT u.suid) AS srcusers > FROM SPARK_33302_1 u >join SPARK_33302_2 h > ON h.uid = u.uid > GROUP BY 1, > 2, > 3) AS a > WHERE viewport = 'iOS' > GROUP BY 1, > 2 > {code} > {noformat} > == Physical Plan == > *(5) HashAggregate(keys=[years#30, appversion#32], > functions=[sum(uusers#33L)]) > +- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251] >+- *(4) HashAggregate(keys=[years#30, appversion#32], > functions=[partial_sum(uusers#33L)]) > +- *(4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS > TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN > 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) > u.`uid`#47 else null)]) > +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS > TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN > 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246] > +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS > TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN > 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = > 1)) u.`uid`#47 else null)]) >+- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS > TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN > 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], > functions=[]) > +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` > AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN > 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), > true, [id=#241] > +- *(2) HashAggregate(keys=[date_trunc('year', > CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN > (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, > u.`suid`#48, gid#44], functions=[]) > +- *(2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' > WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS) >+- *(2) Expand [ArrayBuffer(date_trunc(year, > cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS > WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), > ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE > WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, > vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, > CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE > 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44] > +- *(2) Project [uid#7, dt#9, suid#10, pid#11, > vs#12] > +- *(2) BroadcastHashJoin [uid#7], [uid#13], > Inner, BuildRight > :- *(2) Project [uid#7, dt#9, suid#10] > : +- *(2) Filter isnotnull(ui
[jira] [Commented] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins
[ https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227201#comment-17227201 ] Apache Spark commented on SPARK-32691: -- User 'huangtianhua' has created a pull request for this issue: https://github.com/apache/spark/pull/30275 > Test org.apache.spark.DistributedSuite failed on arm64 jenkins > -- > > Key: SPARK-32691 > URL: https://issues.apache.org/jira/browse/SPARK-32691 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 > Environment: ARM64 >Reporter: huangtianhua >Assignee: zhengruifeng >Priority: Major > Attachments: Screen Shot 2020-09-28 at 8.49.04 AM.png, failure.log, > success.log > > > Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ > - caching in memory and disk, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory and disk, serialized, replicated (encryption = on) > (with replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory, serialized, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > ... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins
[ https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227200#comment-17227200 ] Apache Spark commented on SPARK-32691: -- User 'huangtianhua' has created a pull request for this issue: https://github.com/apache/spark/pull/30275 > Test org.apache.spark.DistributedSuite failed on arm64 jenkins > -- > > Key: SPARK-32691 > URL: https://issues.apache.org/jira/browse/SPARK-32691 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 > Environment: ARM64 >Reporter: huangtianhua >Assignee: zhengruifeng >Priority: Major > Attachments: Screen Shot 2020-09-28 at 8.49.04 AM.png, failure.log, > success.log > > > Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ > - caching in memory and disk, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory and disk, serialized, replicated (encryption = on) > (with replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory, serialized, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > ... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins
[ https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32691: Assignee: Apache Spark (was: zhengruifeng) > Test org.apache.spark.DistributedSuite failed on arm64 jenkins > -- > > Key: SPARK-32691 > URL: https://issues.apache.org/jira/browse/SPARK-32691 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 > Environment: ARM64 >Reporter: huangtianhua >Assignee: Apache Spark >Priority: Major > Attachments: Screen Shot 2020-09-28 at 8.49.04 AM.png, failure.log, > success.log > > > Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ > - caching in memory and disk, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory and disk, serialized, replicated (encryption = on) > (with replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory, serialized, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > ... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins
[ https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32691: Assignee: zhengruifeng (was: Apache Spark) > Test org.apache.spark.DistributedSuite failed on arm64 jenkins > -- > > Key: SPARK-32691 > URL: https://issues.apache.org/jira/browse/SPARK-32691 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 > Environment: ARM64 >Reporter: huangtianhua >Assignee: zhengruifeng >Priority: Major > Attachments: Screen Shot 2020-09-28 at 8.49.04 AM.png, failure.log, > success.log > > > Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ > - caching in memory and disk, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory and disk, serialized, replicated (encryption = on) > (with replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory, serialized, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > ... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins
[ https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227196#comment-17227196 ] huangtianhua commented on SPARK-32691: -- We have found the problem, it is take long time to replicate remote over the default timeout 120 seconds, so it try again to another executor, but in fact the replication is complete, so there are 3 replications total. Then we found the progress hang in CryptoRandomFactory.getCryptoRandom(properties), we found the jar commons-crypto v1.0.0 doesn't support aarch64, after we change to use v1.1.0 then the tests pass and the time is short. So I plan to propose a PR to change to use commons-crypto v1.1.0 which support aarch64: http://commons.apache.org/proper/commons-crypto/changes-report.html https://issues.apache.org/jira/browse/CRYPTO-139 > Test org.apache.spark.DistributedSuite failed on arm64 jenkins > -- > > Key: SPARK-32691 > URL: https://issues.apache.org/jira/browse/SPARK-32691 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 > Environment: ARM64 >Reporter: huangtianhua >Assignee: zhengruifeng >Priority: Major > Attachments: Screen Shot 2020-09-28 at 8.49.04 AM.png, failure.log, > success.log > > > Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ > - caching in memory and disk, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory and disk, serialized, replicated (encryption = on) > (with replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory, serialized, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > ... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227190#comment-17227190 ] angerszhu commented on SPARK-33326: --- Can you show which hive version you are using? Seems in current branch won't show like this > Partition Parameters are not updated even after ANALYZE TABLE command > - > > Key: SPARK-33326 > URL: https://issues.apache.org/jira/browse/SPARK-33326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Daniel Bondor >Priority: Major > > Here are the reproduction steps: > {code:java} > scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p > string) STORED AS PARQUET") > Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486 > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS") > res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false) > ... > |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, > transient_lastDdlTime=1604404640, totalSize=532, > COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}}, > numRows=0}| | > ... > |Partition Statistics |1064 bytes, 2 rows | | > ... > {code} > My expectation would be that the Partition Parameters should be updated after > ANALYZE TABLE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33364) Expose purge option in TableCatalog.dropTable
[ https://issues.apache.org/jira/browse/SPARK-33364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33364. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30267 [https://github.com/apache/spark/pull/30267] > Expose purge option in TableCatalog.dropTable > - > > Key: SPARK-33364 > URL: https://issues.apache.org/jira/browse/SPARK-33364 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > Fix For: 3.1.0 > > > TableCatalog.dropTable currently does not support the purge option. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33364) Expose purge option in TableCatalog.dropTable
[ https://issues.apache.org/jira/browse/SPARK-33364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33364: - Assignee: Terry Kim > Expose purge option in TableCatalog.dropTable > - > > Key: SPARK-33364 > URL: https://issues.apache.org/jira/browse/SPARK-33364 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > > TableCatalog.dropTable currently does not support the purge option. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33130) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MsSqlServer dialect)
[ https://issues.apache.org/jira/browse/SPARK-33130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33130. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30038 [https://github.com/apache/spark/pull/30038] > Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and > nullability of columns (MsSqlServer dialect) > --- > > Key: SPARK-33130 > URL: https://issues.apache.org/jira/browse/SPARK-33130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > Fix For: 3.1.0 > > > Override the default SQL strings for: > ALTER TABLE RENAME COLUMN > ALTER TABLE UPDATE COLUMN NULLABILITY > in the following MsSQLServer JDBC dialect according to official documentation. > Write MsSqlServer integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33130) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MsSqlServer dialect)
[ https://issues.apache.org/jira/browse/SPARK-33130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33130: --- Assignee: Prashant Sharma > Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and > nullability of columns (MsSqlServer dialect) > --- > > Key: SPARK-33130 > URL: https://issues.apache.org/jira/browse/SPARK-33130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Major > > Override the default SQL strings for: > ALTER TABLE RENAME COLUMN > ALTER TABLE UPDATE COLUMN NULLABILITY > in the following MsSQLServer JDBC dialect according to official documentation. > Write MsSqlServer integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33342) In the threadDump page, when a thread is blocked by anther thread, the blocking thread name and url were both displayed incorrect, causing the url to fail to jump.
[ https://issues.apache.org/jira/browse/SPARK-33342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-33342. Resolution: Fixed Issue resolved by pull request 30249 [https://github.com/apache/spark/pull/30249] > In the threadDump page, when a thread is blocked by anther thread, the > blocking thread name and url were both displayed incorrect, causing the url > to fail to jump. > --- > > Key: SPARK-33342 > URL: https://issues.apache.org/jira/browse/SPARK-33342 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 > Environment: spark version: spark3.0.1 >Reporter: akiyamaneko >Assignee: akiyamaneko >Priority: Minor > Fix For: 3.1.0 > > Attachments: cannot-jump.gif, display error.png > > > In the threadDump page, when a thread is blocked by anther thread, the > blocking thread name and url were both displayed incorrect, causing the url > to fail to jump. > such as *Thread 73* but shows `*Thread some(73)*` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33342) In the threadDump page, when a thread is blocked by anther thread, the blocking thread name and url were both displayed incorrect, causing the url to fail to jump.
[ https://issues.apache.org/jira/browse/SPARK-33342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-33342: -- Assignee: akiyamaneko > In the threadDump page, when a thread is blocked by anther thread, the > blocking thread name and url were both displayed incorrect, causing the url > to fail to jump. > --- > > Key: SPARK-33342 > URL: https://issues.apache.org/jira/browse/SPARK-33342 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 > Environment: spark version: spark3.0.1 >Reporter: akiyamaneko >Assignee: akiyamaneko >Priority: Minor > Fix For: 3.1.0 > > Attachments: cannot-jump.gif, display error.png > > > In the threadDump page, when a thread is blocked by anther thread, the > blocking thread name and url were both displayed incorrect, causing the url > to fail to jump. > such as *Thread 73* but shows `*Thread some(73)*` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32860) Encoders::bean doc incorrectly states maps are not supported
[ https://issues.apache.org/jira/browse/SPARK-32860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227177#comment-17227177 ] Apache Spark commented on SPARK-32860: -- User 'hannahkamundson' has created a pull request for this issue: https://github.com/apache/spark/pull/30274 > Encoders::bean doc incorrectly states maps are not supported > > > Key: SPARK-32860 > URL: https://issues.apache.org/jira/browse/SPARK-32860 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.4.6, 3.0.1, 3.1.0 >Reporter: Dan Ziemba >Priority: Trivial > Labels: starter > > The documentation for the bean method in the Encoders class currently states: > {quote}collection types: only array and java.util.List currently, map support > is in progress > {quote} > But map support appears to work properly and has been available since 2.1.0 > according to SPARK-16706. Documentation should be updated to match what is / > is not actually supported (Set, Queue, etc?). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33369) Skip schema inference in DataframeWriter.save() if table provider supports external metadata
[ https://issues.apache.org/jira/browse/SPARK-33369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227174#comment-17227174 ] Apache Spark commented on SPARK-33369: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/30273 > Skip schema inference in DataframeWriter.save() if table provider supports > external metadata > > > Key: SPARK-33369 > URL: https://issues.apache.org/jira/browse/SPARK-33369 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > For all the v2 data sources which are not FileDataSourceV2, Spark always > infers the table schema/partitioning on DataframeWriter.save(). > The inference of table schema/partitioning can be expensive. However, there > is no such trait or flag for indicating a V2 source can use the input > DataFrame's schema on DataframeWriter.save(). We can resolve the problem by > adding a new expected behavior for the method > TableProvider.supportsExternalMetadata(): > When TableProvider.supportsExternalMetadata() is true, Spark will use the > input Dataframe's schema in DataframeWriter.save() and skip > schema/partitioning inference. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33369) Skip schema inference in DataframeWriter.save() if table provider supports external metadata
[ https://issues.apache.org/jira/browse/SPARK-33369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33369: Assignee: Apache Spark (was: Gengliang Wang) > Skip schema inference in DataframeWriter.save() if table provider supports > external metadata > > > Key: SPARK-33369 > URL: https://issues.apache.org/jira/browse/SPARK-33369 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > For all the v2 data sources which are not FileDataSourceV2, Spark always > infers the table schema/partitioning on DataframeWriter.save(). > The inference of table schema/partitioning can be expensive. However, there > is no such trait or flag for indicating a V2 source can use the input > DataFrame's schema on DataframeWriter.save(). We can resolve the problem by > adding a new expected behavior for the method > TableProvider.supportsExternalMetadata(): > When TableProvider.supportsExternalMetadata() is true, Spark will use the > input Dataframe's schema in DataframeWriter.save() and skip > schema/partitioning inference. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33369) Skip schema inference in DataframeWriter.save() if table provider supports external metadata
[ https://issues.apache.org/jira/browse/SPARK-33369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227173#comment-17227173 ] Apache Spark commented on SPARK-33369: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/30273 > Skip schema inference in DataframeWriter.save() if table provider supports > external metadata > > > Key: SPARK-33369 > URL: https://issues.apache.org/jira/browse/SPARK-33369 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > For all the v2 data sources which are not FileDataSourceV2, Spark always > infers the table schema/partitioning on DataframeWriter.save(). > The inference of table schema/partitioning can be expensive. However, there > is no such trait or flag for indicating a V2 source can use the input > DataFrame's schema on DataframeWriter.save(). We can resolve the problem by > adding a new expected behavior for the method > TableProvider.supportsExternalMetadata(): > When TableProvider.supportsExternalMetadata() is true, Spark will use the > input Dataframe's schema in DataframeWriter.save() and skip > schema/partitioning inference. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33369) Skip schema inference in DataframeWriter.save() if table provider supports external metadata
[ https://issues.apache.org/jira/browse/SPARK-33369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33369: Assignee: Gengliang Wang (was: Apache Spark) > Skip schema inference in DataframeWriter.save() if table provider supports > external metadata > > > Key: SPARK-33369 > URL: https://issues.apache.org/jira/browse/SPARK-33369 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > For all the v2 data sources which are not FileDataSourceV2, Spark always > infers the table schema/partitioning on DataframeWriter.save(). > The inference of table schema/partitioning can be expensive. However, there > is no such trait or flag for indicating a V2 source can use the input > DataFrame's schema on DataframeWriter.save(). We can resolve the problem by > adding a new expected behavior for the method > TableProvider.supportsExternalMetadata(): > When TableProvider.supportsExternalMetadata() is true, Spark will use the > input Dataframe's schema in DataframeWriter.save() and skip > schema/partitioning inference. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33369) Skip schema inference in DataframeWriter.save() if table provider supports external metadata
[ https://issues.apache.org/jira/browse/SPARK-33369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-33369: --- Description: For all the v2 data sources which are not FileDataSourceV2, Spark always infers the table schema/partitioning on DataframeWriter.save(). The inference of table schema/partitioning can be expensive. However, there is no such trait or flag for indicating a V2 source can use the input DataFrame's schema on DataframeWriter.save(). We can resolve the problem by adding a new expected behavior for the method TableProvider.supportsExternalMetadata(): When TableProvider.supportsExternalMetadata() is true, Spark will use the input Dataframe's schema in DataframeWriter.save() and skip schema/partitioning inference. was: For all the v2 data sources which are not FileDataSourceV2, Spark always infer the table schema/partitioning on DataframeWriter.save(). Currently, there is no such trait or flag for indicating a V2 source can use the input DataFrame's schema on DataframeWriter.save(). We can resolve the problem by adding a new expected behavior for the method TableProvider.supportsExternalMetadata(): when TableProvider.supportsExternalMetadata() is true, Spark will use the input Dataframe's schema in DataframeWriter.save() and skip schema/partitioning inference. > Skip schema inference in DataframeWriter.save() if table provider supports > external metadata > > > Key: SPARK-33369 > URL: https://issues.apache.org/jira/browse/SPARK-33369 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > For all the v2 data sources which are not FileDataSourceV2, Spark always > infers the table schema/partitioning on DataframeWriter.save(). > The inference of table schema/partitioning can be expensive. However, there > is no such trait or flag for indicating a V2 source can use the input > DataFrame's schema on DataframeWriter.save(). We can resolve the problem by > adding a new expected behavior for the method > TableProvider.supportsExternalMetadata(): > When TableProvider.supportsExternalMetadata() is true, Spark will use the > input Dataframe's schema in DataframeWriter.save() and skip > schema/partitioning inference. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33369) Skip schema inference in DataframeWriter.save() if table provider supports external metadata
Gengliang Wang created SPARK-33369: -- Summary: Skip schema inference in DataframeWriter.save() if table provider supports external metadata Key: SPARK-33369 URL: https://issues.apache.org/jira/browse/SPARK-33369 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Gengliang Wang Assignee: Gengliang Wang For all the v2 data sources which are not FileDataSourceV2, Spark always infer the table schema/partitioning on DataframeWriter.save(). Currently, there is no such trait or flag for indicating a V2 source can use the input DataFrame's schema on DataframeWriter.save(). We can resolve the problem by adding a new expected behavior for the method TableProvider.supportsExternalMetadata(): when TableProvider.supportsExternalMetadata() is true, Spark will use the input Dataframe's schema in DataframeWriter.save() and skip schema/partitioning inference. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33368) SimplifyConditionals simplifies non-deterministic expressions
Yuming Wang created SPARK-33368: --- Summary: SimplifyConditionals simplifies non-deterministic expressions Key: SPARK-33368 URL: https://issues.apache.org/jira/browse/SPARK-33368 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 2.4.7, 3.1.0 Reporter: Yuming Wang It seems we simplified non-deterministic expressions with aliases. for example: {code:sql} CREATE TABLE t(a int, b int, c int) using parquet {code} {code:sql} sql SELECT CASE WHEN rand(100) > 1 THEN 1 WHEN rand(100) + 1 > 1000 THEN 1 WHEN rand(100) + 2 < 100 THEN 1 ELSE 1 END AS x FROM t {code} The plan is: {noformat} == Physical Plan == *(1) Project [CASE WHEN (rand(100) > 1.0) THEN 1 WHEN ((rand(100) + 1.0) > 1000.0) THEN 1 WHEN ((rand(100) + 2.0) < 100.0) THEN 1 ELSE 1 END AS x#6] +- *(1) ColumnarToRow +- FileScan parquet default.t[] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/sql/core/spark-warehouse/org.apache.spark, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> {noformat} {code:sql} SELECT CASE WHEN rd > 1 THEN 1 WHEN rd + 1 > 1000 THEN 1 WHEN rd + 2 < 100 THEN 1 ELSE 1 END AS x FROM (SELECT *, rand(100) as rd FROM t) t1 {code} The plan is: {noformat} == Physical Plan == *(1) Project [1 AS x#1] +- *(1) ColumnarToRow +- FileScan parquet default.t[] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/sql/core/spark-warehouse/org.apache.spark, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33367) Move the table commands from ddl.scala to table.scala
[ https://issues.apache.org/jira/browse/SPARK-33367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33367: Assignee: Apache Spark > Move the table commands from ddl.scala to table.scala > - > > Key: SPARK-33367 > URL: https://issues.apache.org/jira/browse/SPARK-33367 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Minor > > The table commands can be moved from `ddl.scala` to `table.scala` per the > following: > https://github.com/apache/spark/blob/4941b7ae18d4081233953cc11328645d0b4cf208/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L137 > This change will result in table-related commands in the same file > `table.scala`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33367) Move the table commands from ddl.scala to table.scala
[ https://issues.apache.org/jira/browse/SPARK-33367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227169#comment-17227169 ] Apache Spark commented on SPARK-33367: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/30271 > Move the table commands from ddl.scala to table.scala > - > > Key: SPARK-33367 > URL: https://issues.apache.org/jira/browse/SPARK-33367 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > The table commands can be moved from `ddl.scala` to `table.scala` per the > following: > https://github.com/apache/spark/blob/4941b7ae18d4081233953cc11328645d0b4cf208/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L137 > This change will result in table-related commands in the same file > `table.scala`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33367) Move the table commands from ddl.scala to table.scala
[ https://issues.apache.org/jira/browse/SPARK-33367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33367: Assignee: (was: Apache Spark) > Move the table commands from ddl.scala to table.scala > - > > Key: SPARK-33367 > URL: https://issues.apache.org/jira/browse/SPARK-33367 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > The table commands can be moved from `ddl.scala` to `table.scala` per the > following: > https://github.com/apache/spark/blob/4941b7ae18d4081233953cc11328645d0b4cf208/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L137 > This change will result in table-related commands in the same file > `table.scala`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33367) Move the table commands from ddl.scala to table.scala
Terry Kim created SPARK-33367: - Summary: Move the table commands from ddl.scala to table.scala Key: SPARK-33367 URL: https://issues.apache.org/jira/browse/SPARK-33367 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Terry Kim The table commands can be moved from `ddl.scala` to `table.scala` per the following: https://github.com/apache/spark/blob/4941b7ae18d4081233953cc11328645d0b4cf208/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L137 This change will result in table-related commands in the same file `table.scala`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32860) Encoders::bean doc incorrectly states maps are not supported
[ https://issues.apache.org/jira/browse/SPARK-32860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32860: Assignee: (was: Apache Spark) > Encoders::bean doc incorrectly states maps are not supported > > > Key: SPARK-32860 > URL: https://issues.apache.org/jira/browse/SPARK-32860 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.4.6, 3.0.1, 3.1.0 >Reporter: Dan Ziemba >Priority: Trivial > Labels: starter > > The documentation for the bean method in the Encoders class currently states: > {quote}collection types: only array and java.util.List currently, map support > is in progress > {quote} > But map support appears to work properly and has been available since 2.1.0 > according to SPARK-16706. Documentation should be updated to match what is / > is not actually supported (Set, Queue, etc?). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32860) Encoders::bean doc incorrectly states maps are not supported
[ https://issues.apache.org/jira/browse/SPARK-32860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227155#comment-17227155 ] Apache Spark commented on SPARK-32860: -- User 'hannahkamundson' has created a pull request for this issue: https://github.com/apache/spark/pull/30272 > Encoders::bean doc incorrectly states maps are not supported > > > Key: SPARK-32860 > URL: https://issues.apache.org/jira/browse/SPARK-32860 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.4.6, 3.0.1, 3.1.0 >Reporter: Dan Ziemba >Priority: Trivial > Labels: starter > > The documentation for the bean method in the Encoders class currently states: > {quote}collection types: only array and java.util.List currently, map support > is in progress > {quote} > But map support appears to work properly and has been available since 2.1.0 > according to SPARK-16706. Documentation should be updated to match what is / > is not actually supported (Set, Queue, etc?). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32860) Encoders::bean doc incorrectly states maps are not supported
[ https://issues.apache.org/jira/browse/SPARK-32860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32860: Assignee: Apache Spark > Encoders::bean doc incorrectly states maps are not supported > > > Key: SPARK-32860 > URL: https://issues.apache.org/jira/browse/SPARK-32860 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.4.6, 3.0.1, 3.1.0 >Reporter: Dan Ziemba >Assignee: Apache Spark >Priority: Trivial > Labels: starter > > The documentation for the bean method in the Encoders class currently states: > {quote}collection types: only array and java.util.List currently, map support > is in progress > {quote} > But map support appears to work properly and has been available since 2.1.0 > according to SPARK-16706. Documentation should be updated to match what is / > is not actually supported (Set, Queue, etc?). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33366) Migrate LOAD DATA to new resolution framework
[ https://issues.apache.org/jira/browse/SPARK-33366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227127#comment-17227127 ] Apache Spark commented on SPARK-33366: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/30270 > Migrate LOAD DATA to new resolution framework > - > > Key: SPARK-33366 > URL: https://issues.apache.org/jira/browse/SPARK-33366 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > Migrate LOAD DATA to new resolution framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33366) Migrate LOAD DATA to new resolution framework
[ https://issues.apache.org/jira/browse/SPARK-33366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33366: Assignee: (was: Apache Spark) > Migrate LOAD DATA to new resolution framework > - > > Key: SPARK-33366 > URL: https://issues.apache.org/jira/browse/SPARK-33366 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > Migrate LOAD DATA to new resolution framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33366) Migrate LOAD DATA to new resolution framework
[ https://issues.apache.org/jira/browse/SPARK-33366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227126#comment-17227126 ] Apache Spark commented on SPARK-33366: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/30270 > Migrate LOAD DATA to new resolution framework > - > > Key: SPARK-33366 > URL: https://issues.apache.org/jira/browse/SPARK-33366 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > Migrate LOAD DATA to new resolution framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33366) Migrate LOAD DATA to new resolution framework
[ https://issues.apache.org/jira/browse/SPARK-33366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33366: Assignee: Apache Spark > Migrate LOAD DATA to new resolution framework > - > > Key: SPARK-33366 > URL: https://issues.apache.org/jira/browse/SPARK-33366 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Minor > > Migrate LOAD DATA to new resolution framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227112#comment-17227112 ] angerszhu commented on SPARK-33326: --- working on this. > Partition Parameters are not updated even after ANALYZE TABLE command > - > > Key: SPARK-33326 > URL: https://issues.apache.org/jira/browse/SPARK-33326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Daniel Bondor >Priority: Major > > Here are the reproduction steps: > {code:java} > scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p > string) STORED AS PARQUET") > Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486 > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS") > res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false) > ... > |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, > transient_lastDdlTime=1604404640, totalSize=532, > COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}}, > numRows=0}| | > ... > |Partition Statistics |1064 bytes, 2 rows | | > ... > {code} > My expectation would be that the Partition Parameters should be updated after > ANALYZE TABLE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33366) Migrate LOAD DATA to new resolution framework
Terry Kim created SPARK-33366: - Summary: Migrate LOAD DATA to new resolution framework Key: SPARK-33366 URL: https://issues.apache.org/jira/browse/SPARK-33366 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Terry Kim Migrate LOAD DATA to new resolution framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-33326: -- Comment: was deleted (was: Working. on this. [~dabondor] Your hive version is 2.3? Since seems Hive-1.2.1 don't have this partition parameter) > Partition Parameters are not updated even after ANALYZE TABLE command > - > > Key: SPARK-33326 > URL: https://issues.apache.org/jira/browse/SPARK-33326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Daniel Bondor >Priority: Major > > Here are the reproduction steps: > {code:java} > scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p > string) STORED AS PARQUET") > Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486 > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS") > res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false) > ... > |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, > transient_lastDdlTime=1604404640, totalSize=532, > COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}}, > numRows=0}| | > ... > |Partition Statistics |1064 bytes, 2 rows | | > ... > {code} > My expectation would be that the Partition Parameters should be updated after > ANALYZE TABLE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227107#comment-17227107 ] angerszhu edited comment on SPARK-33326 at 11/6/20, 2:05 AM: - Working. on this. [~dabondor] Your hive version is 2.3? Since seems Hive-1.2.1 don't have this partition parameter was (Author: angerszhuuu): Working. on this > Partition Parameters are not updated even after ANALYZE TABLE command > - > > Key: SPARK-33326 > URL: https://issues.apache.org/jira/browse/SPARK-33326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Daniel Bondor >Priority: Major > > Here are the reproduction steps: > {code:java} > scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p > string) STORED AS PARQUET") > Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486 > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS") > res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false) > ... > |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, > transient_lastDdlTime=1604404640, totalSize=532, > COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}}, > numRows=0}| | > ... > |Partition Statistics |1064 bytes, 2 rows | | > ... > {code} > My expectation would be that the Partition Parameters should be updated after > ANALYZE TABLE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33326) Partition Parameters are not updated even after ANALYZE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-33326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227107#comment-17227107 ] angerszhu commented on SPARK-33326: --- Working. on this > Partition Parameters are not updated even after ANALYZE TABLE command > - > > Key: SPARK-33326 > URL: https://issues.apache.org/jira/browse/SPARK-33326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Daniel Bondor >Priority: Major > > Here are the reproduction steps: > {code:java} > scala> spark.sql("CREATE TABLE t (a string,b string) PARTITIONED BY (p > string) STORED AS PARQUET") > Hive Session ID = d44e21ee-2d5c-48ab-91bf-26cb25775486 > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('aaa', 'bbb')") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("INSERT INTO t PARTITION(p='p1') VALUES ('ccc', 'ddd')") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("ANALYZE TABLE t PARTITION(p='p1') COMPUTE STATISTICS") > res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("DESCRIBE FORMATTED t PARTITION (p='p1')").show(50, false) > ... > |Partition Parameters |{rawDataSize=0, numFiles=1, numFilesErasureCoded=0, > transient_lastDdlTime=1604404640, totalSize=532, > COLUMN_STATS_ACCURATE={"BASIC_STATS":"true","COLUMN_STATS":{"a":"true","b":"true"}}, > numRows=0}| | > ... > |Partition Statistics |1064 bytes, 2 rows | | > ... > {code} > My expectation would be that the Partition Parameters should be updated after > ANALYZE TABLE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33365) Update SBT to 1.4.2
[ https://issues.apache.org/jira/browse/SPARK-33365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33365: - Assignee: William Hyun > Update SBT to 1.4.2 > --- > > Key: SPARK-33365 > URL: https://issues.apache.org/jira/browse/SPARK-33365 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33365) Update SBT to 1.4.2
[ https://issues.apache.org/jira/browse/SPARK-33365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33365. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30268 [https://github.com/apache/spark/pull/30268] > Update SBT to 1.4.2 > --- > > Key: SPARK-33365 > URL: https://issues.apache.org/jira/browse/SPARK-33365 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33360) simplify DS v2 write resolution
[ https://issues.apache.org/jira/browse/SPARK-33360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33360. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30264 [https://github.com/apache/spark/pull/30264] > simplify DS v2 write resolution > --- > > Key: SPARK-33360 > URL: https://issues.apache.org/jira/browse/SPARK-33360 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33361) Dataset.observe() functionality does not work with structured streaming
[ https://issues.apache.org/jira/browse/SPARK-33361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227078#comment-17227078 ] Jungtaek Lim commented on SPARK-33361: -- cc. [~hvanhovell] > Dataset.observe() functionality does not work with structured streaming > --- > > Key: SPARK-33361 > URL: https://issues.apache.org/jira/browse/SPARK-33361 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: Spark on k8s, version 3.0.0 >Reporter: John Wesley >Priority: Major > > The dataset observe() functionality does not work as expected with spark in > cluster mode. > Related discussion here: > [http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-emit-custom-metrics-to-Prometheus-in-spark-structured-streaming-td38826.html] > Using lit() as the aggregation column goes through well. However sum, count > etc returns 0 all the time. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33359) foreachBatch sink outputs wrong metrics
[ https://issues.apache.org/jira/browse/SPARK-33359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227064#comment-17227064 ] John Wesley commented on SPARK-33359: - Thanks for the clarification. How about numInputRows? > foreachBatch sink outputs wrong metrics > --- > > Key: SPARK-33359 > URL: https://issues.apache.org/jira/browse/SPARK-33359 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: Spark on Kubernetes cluster with spark-3.0.0 image. The > CRD is ScheduledSparkApplication >Reporter: John Wesley >Priority: Minor > > I created 2 similar jobs, > 1) First job reading from kafka and writing to console sink in append mode > 2) Second job reading from kafka and writing to foreachBatch sink (which then > writes in parquet format to S3). > The metrics in the log for console shows correct values for numInputRows and > numOutputRows whereas they are wrong for foreachBatch. > With foreachBatch: > numInputRows is +1 more than what is actually present > numOutputRows is always -1. > ///Console sink > //20/11/05 13:36:21 INFO > MicroBatchExecution: Streaming query made progress: { > "id" : "775aa543-58bf-4cf7-b274-390da640b6ae", > "runId" : "e5eac4ca-0b29-4ed4-be35-b70bd20906d5", > "name" : null, > "timestamp" : "2020-11-05T13:36:08.921Z", > "batchId" : 0, > "numInputRows" : 10, > "processedRowsPerSecond" : 0.7759757895553658, > "durationMs" : { > "addBatch" : 7735, > "getBatch" : 152, > "latestOffset" : 2037, > "queryPlanning" : 1010, > "triggerExecution" : 12886, > "walCommit" : 938 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "KafkaV2[Subscribe[testedr7]]", > "startOffset" : null, > "endOffset" : { > "testedr7" : { > "0" : 10 > } > }, > "numInputRows" : 10, > "processedRowsPerSecond" : 0.7759757895553658 > } ], > "sink" : { > "description" : > "org.apache.spark.sql.execution.streaming.ConsoleTable$@38c3a814", > "numOutputRows" : 10 > } > } > ///ForEachBatch Sink > //20/11/05 13:43:38 INFO > MicroBatchExecution: Streaming query made progress: { > "id" : "789f9a00-2f2a-4f75-b643-fea201088b4a", > "runId" : "b5e695c5-3a2e-4ad2-9dbf-11b69f368f61", > "name" : null, > "timestamp" : "2020-11-05T13:43:15.421Z", > "batchId" : 0, > "numInputRows" : 11, > "processedRowsPerSecond" : 0.4833252779120348, > "durationMs" : { > "addBatch" : 17689, > "getBatch" : 135, > "latestOffset" : 2121, > "queryPlanning" : 880, > "triggerExecution" : 22758, > "walCommit" : 876 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "KafkaV2[Subscribe[testedr7]]", > "startOffset" : null, > "endOffset" : { > "testedr7" : { > "0" : 10 > } > }, > "numInputRows" : 11, > "processedRowsPerSecond" : 0.4833252779120348 > } ], > "sink" : { > "description" : "ForeachBatchSink", > "numOutputRows" : -1 > } > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33359) foreachBatch sink outputs wrong metrics
[ https://issues.apache.org/jira/browse/SPARK-33359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-33359. -- Resolution: Not A Problem > foreachBatch sink outputs wrong metrics > --- > > Key: SPARK-33359 > URL: https://issues.apache.org/jira/browse/SPARK-33359 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: Spark on Kubernetes cluster with spark-3.0.0 image. The > CRD is ScheduledSparkApplication >Reporter: John Wesley >Priority: Minor > > I created 2 similar jobs, > 1) First job reading from kafka and writing to console sink in append mode > 2) Second job reading from kafka and writing to foreachBatch sink (which then > writes in parquet format to S3). > The metrics in the log for console shows correct values for numInputRows and > numOutputRows whereas they are wrong for foreachBatch. > With foreachBatch: > numInputRows is +1 more than what is actually present > numOutputRows is always -1. > ///Console sink > //20/11/05 13:36:21 INFO > MicroBatchExecution: Streaming query made progress: { > "id" : "775aa543-58bf-4cf7-b274-390da640b6ae", > "runId" : "e5eac4ca-0b29-4ed4-be35-b70bd20906d5", > "name" : null, > "timestamp" : "2020-11-05T13:36:08.921Z", > "batchId" : 0, > "numInputRows" : 10, > "processedRowsPerSecond" : 0.7759757895553658, > "durationMs" : { > "addBatch" : 7735, > "getBatch" : 152, > "latestOffset" : 2037, > "queryPlanning" : 1010, > "triggerExecution" : 12886, > "walCommit" : 938 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "KafkaV2[Subscribe[testedr7]]", > "startOffset" : null, > "endOffset" : { > "testedr7" : { > "0" : 10 > } > }, > "numInputRows" : 10, > "processedRowsPerSecond" : 0.7759757895553658 > } ], > "sink" : { > "description" : > "org.apache.spark.sql.execution.streaming.ConsoleTable$@38c3a814", > "numOutputRows" : 10 > } > } > ///ForEachBatch Sink > //20/11/05 13:43:38 INFO > MicroBatchExecution: Streaming query made progress: { > "id" : "789f9a00-2f2a-4f75-b643-fea201088b4a", > "runId" : "b5e695c5-3a2e-4ad2-9dbf-11b69f368f61", > "name" : null, > "timestamp" : "2020-11-05T13:43:15.421Z", > "batchId" : 0, > "numInputRows" : 11, > "processedRowsPerSecond" : 0.4833252779120348, > "durationMs" : { > "addBatch" : 17689, > "getBatch" : 135, > "latestOffset" : 2121, > "queryPlanning" : 880, > "triggerExecution" : 22758, > "walCommit" : 876 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "KafkaV2[Subscribe[testedr7]]", > "startOffset" : null, > "endOffset" : { > "testedr7" : { > "0" : 10 > } > }, > "numInputRows" : 11, > "processedRowsPerSecond" : 0.4833252779120348 > } ], > "sink" : { > "description" : "ForeachBatchSink", > "numOutputRows" : -1 > } > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33359) foreachBatch sink outputs wrong metrics
[ https://issues.apache.org/jira/browse/SPARK-33359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227029#comment-17227029 ] Jungtaek Lim commented on SPARK-33359: -- That's by design. The metric is available only for V2 sink. The interface of V1 sink doesn't open the possibility to count the number of output rows, as it can run arbitrary DataFrame operations not just writing, just like we do with foreachBatch. -1 means "unknown". Probably better to document this to avoid confusion, but this by itself is not a bug. Let me close this for now. Thanks. > foreachBatch sink outputs wrong metrics > --- > > Key: SPARK-33359 > URL: https://issues.apache.org/jira/browse/SPARK-33359 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: Spark on Kubernetes cluster with spark-3.0.0 image. The > CRD is ScheduledSparkApplication >Reporter: John Wesley >Priority: Minor > > I created 2 similar jobs, > 1) First job reading from kafka and writing to console sink in append mode > 2) Second job reading from kafka and writing to foreachBatch sink (which then > writes in parquet format to S3). > The metrics in the log for console shows correct values for numInputRows and > numOutputRows whereas they are wrong for foreachBatch. > With foreachBatch: > numInputRows is +1 more than what is actually present > numOutputRows is always -1. > ///Console sink > //20/11/05 13:36:21 INFO > MicroBatchExecution: Streaming query made progress: { > "id" : "775aa543-58bf-4cf7-b274-390da640b6ae", > "runId" : "e5eac4ca-0b29-4ed4-be35-b70bd20906d5", > "name" : null, > "timestamp" : "2020-11-05T13:36:08.921Z", > "batchId" : 0, > "numInputRows" : 10, > "processedRowsPerSecond" : 0.7759757895553658, > "durationMs" : { > "addBatch" : 7735, > "getBatch" : 152, > "latestOffset" : 2037, > "queryPlanning" : 1010, > "triggerExecution" : 12886, > "walCommit" : 938 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "KafkaV2[Subscribe[testedr7]]", > "startOffset" : null, > "endOffset" : { > "testedr7" : { > "0" : 10 > } > }, > "numInputRows" : 10, > "processedRowsPerSecond" : 0.7759757895553658 > } ], > "sink" : { > "description" : > "org.apache.spark.sql.execution.streaming.ConsoleTable$@38c3a814", > "numOutputRows" : 10 > } > } > ///ForEachBatch Sink > //20/11/05 13:43:38 INFO > MicroBatchExecution: Streaming query made progress: { > "id" : "789f9a00-2f2a-4f75-b643-fea201088b4a", > "runId" : "b5e695c5-3a2e-4ad2-9dbf-11b69f368f61", > "name" : null, > "timestamp" : "2020-11-05T13:43:15.421Z", > "batchId" : 0, > "numInputRows" : 11, > "processedRowsPerSecond" : 0.4833252779120348, > "durationMs" : { > "addBatch" : 17689, > "getBatch" : 135, > "latestOffset" : 2121, > "queryPlanning" : 880, > "triggerExecution" : 22758, > "walCommit" : 876 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "KafkaV2[Subscribe[testedr7]]", > "startOffset" : null, > "endOffset" : { > "testedr7" : { > "0" : 10 > } > }, > "numInputRows" : 11, > "processedRowsPerSecond" : 0.4833252779120348 > } ], > "sink" : { > "description" : "ForeachBatchSink", > "numOutputRows" : -1 > } > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33365) Update SBT to 1.4.2
[ https://issues.apache.org/jira/browse/SPARK-33365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227027#comment-17227027 ] Apache Spark commented on SPARK-33365: -- User 'williamhyun' has created a pull request for this issue: https://github.com/apache/spark/pull/30268 > Update SBT to 1.4.2 > --- > > Key: SPARK-33365 > URL: https://issues.apache.org/jira/browse/SPARK-33365 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33365) Update SBT to 1.4.2
[ https://issues.apache.org/jira/browse/SPARK-33365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227026#comment-17227026 ] Apache Spark commented on SPARK-33365: -- User 'williamhyun' has created a pull request for this issue: https://github.com/apache/spark/pull/30268 > Update SBT to 1.4.2 > --- > > Key: SPARK-33365 > URL: https://issues.apache.org/jira/browse/SPARK-33365 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33365) Update SBT to 1.4.2
[ https://issues.apache.org/jira/browse/SPARK-33365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33365: Assignee: Apache Spark > Update SBT to 1.4.2 > --- > > Key: SPARK-33365 > URL: https://issues.apache.org/jira/browse/SPARK-33365 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: William Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33365) Update SBT to 1.4.2
[ https://issues.apache.org/jira/browse/SPARK-33365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33365: Assignee: (was: Apache Spark) > Update SBT to 1.4.2 > --- > > Key: SPARK-33365 > URL: https://issues.apache.org/jira/browse/SPARK-33365 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33365) Update SBT to 1.4.2
William Hyun created SPARK-33365: Summary: Update SBT to 1.4.2 Key: SPARK-33365 URL: https://issues.apache.org/jira/browse/SPARK-33365 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.1.0 Reporter: William Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33364) Expose purge option in TableCatalog.dropTable
[ https://issues.apache.org/jira/browse/SPARK-33364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226950#comment-17226950 ] Apache Spark commented on SPARK-33364: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/30267 > Expose purge option in TableCatalog.dropTable > - > > Key: SPARK-33364 > URL: https://issues.apache.org/jira/browse/SPARK-33364 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > TableCatalog.dropTable currently does not support the purge option. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33364) Expose purge option in TableCatalog.dropTable
[ https://issues.apache.org/jira/browse/SPARK-33364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33364: Assignee: Apache Spark > Expose purge option in TableCatalog.dropTable > - > > Key: SPARK-33364 > URL: https://issues.apache.org/jira/browse/SPARK-33364 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Minor > > TableCatalog.dropTable currently does not support the purge option. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33364) Expose purge option in TableCatalog.dropTable
[ https://issues.apache.org/jira/browse/SPARK-33364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33364: Assignee: (was: Apache Spark) > Expose purge option in TableCatalog.dropTable > - > > Key: SPARK-33364 > URL: https://issues.apache.org/jira/browse/SPARK-33364 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > TableCatalog.dropTable currently does not support the purge option. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33364) Expose purge option in TableCatalog.dropTable
Terry Kim created SPARK-33364: - Summary: Expose purge option in TableCatalog.dropTable Key: SPARK-33364 URL: https://issues.apache.org/jira/browse/SPARK-33364 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Terry Kim TableCatalog.dropTable currently does not support the purge option. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33185) YARN: Print direct links to driver logs alongside application report in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-33185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-33185: --- Assignee: Erik Krogen > YARN: Print direct links to driver logs alongside application report in > cluster mode > > > Key: SPARK-33185 > URL: https://issues.apache.org/jira/browse/SPARK-33185 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > > Currently when run in {{cluster}} mode on YARN, the Spark {{yarn.Client}} > will print out the application report into the logs, to be easily viewed by > users. For example: > {code} > INFO yarn.Client: >client token: Token { kind: YARN_CLIENT_TOKEN, service: } >diagnostics: N/A >ApplicationMaster host: X.X.X.X >ApplicationMaster RPC port: 0 >queue: default >start time: 1602782566027 >final status: UNDEFINED >tracking URL: http://hostname:/proxy/application_/ >user: xkrogen > {code} > Typically, the tracking URL can be used to find the logs of the > ApplicationMaster/driver while the application is running. Later, the Spark > History Server can be used to track this information down, using the > stdout/stderr links on the Executors page. > However, in the situation when the driver crashed _before_ writing out a > history file, the SHS may not be aware of this application, and thus does not > contain links to the driver logs. When this situation arises, it can be > difficult for users to debug further, since they can't easily find their > driver logs. > It is possible to reach the logs by using the {{yarn logs}} commands, but the > average Spark user isn't aware of this and shouldn't have to be. > I propose adding, alongside the application report, some additional lines > like: > {code} > Driver Logs (stdout): > http://hostname:8042/node/containerlogs/container_/xkrogen/stdout?start=-4096 > Driver Logs (stderr): > http://hostname:8042/node/containerlogs/container_/xkrogen/stderr?start=-4096 > {code} > With this information available, users can quickly jump to their driver logs, > even if it crashed before the SHS became aware of the application. This has > the additional benefit of providing a quick way to access driver logs, which > often contain useful information, in a single click (instead of navigating > through the Spark UI). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33185) YARN: Print direct links to driver logs alongside application report in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-33185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-33185. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30096 [https://github.com/apache/spark/pull/30096] > YARN: Print direct links to driver logs alongside application report in > cluster mode > > > Key: SPARK-33185 > URL: https://issues.apache.org/jira/browse/SPARK-33185 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.1.0 > > > Currently when run in {{cluster}} mode on YARN, the Spark {{yarn.Client}} > will print out the application report into the logs, to be easily viewed by > users. For example: > {code} > INFO yarn.Client: >client token: Token { kind: YARN_CLIENT_TOKEN, service: } >diagnostics: N/A >ApplicationMaster host: X.X.X.X >ApplicationMaster RPC port: 0 >queue: default >start time: 1602782566027 >final status: UNDEFINED >tracking URL: http://hostname:/proxy/application_/ >user: xkrogen > {code} > Typically, the tracking URL can be used to find the logs of the > ApplicationMaster/driver while the application is running. Later, the Spark > History Server can be used to track this information down, using the > stdout/stderr links on the Executors page. > However, in the situation when the driver crashed _before_ writing out a > history file, the SHS may not be aware of this application, and thus does not > contain links to the driver logs. When this situation arises, it can be > difficult for users to debug further, since they can't easily find their > driver logs. > It is possible to reach the logs by using the {{yarn logs}} commands, but the > average Spark user isn't aware of this and shouldn't have to be. > I propose adding, alongside the application report, some additional lines > like: > {code} > Driver Logs (stdout): > http://hostname:8042/node/containerlogs/container_/xkrogen/stdout?start=-4096 > Driver Logs (stderr): > http://hostname:8042/node/containerlogs/container_/xkrogen/stderr?start=-4096 > {code} > With this information available, users can quickly jump to their driver logs, > even if it crashed before the SHS became aware of the application. This has > the additional benefit of providing a quick way to access driver logs, which > often contain useful information, in a single click (instead of navigating > through the Spark UI). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33353) Cache dependencies for Coursier with new sbt in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-33353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33353. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30259 [https://github.com/apache/spark/pull/30259] > Cache dependencies for Coursier with new sbt in GitHub Actions > -- > > Key: SPARK-33353 > URL: https://issues.apache.org/jira/browse/SPARK-33353 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.1.0 > > > SPARK-33226 upgraded sbt to 1.4.1. > As of 1.3.0, sbt uses Coursier as the dependency resolver / fetcher. > So let's change the dependency cache configuration for the GitHub Actions job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33362) skipSchemaResolution should still require query to be resolved
[ https://issues.apache.org/jira/browse/SPARK-33362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33362. --- Fix Version/s: 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 30265 [https://github.com/apache/spark/pull/30265] > skipSchemaResolution should still require query to be resolved > -- > > Key: SPARK-33362 > URL: https://issues.apache.org/jira/browse/SPARK-33362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.1.0, 3.0.2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33363) Add prompt information related to the current task when pyspark starts
[ https://issues.apache.org/jira/browse/SPARK-33363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33363: Assignee: (was: Apache Spark) > Add prompt information related to the current task when pyspark starts > -- > > Key: SPARK-33363 > URL: https://issues.apache.org/jira/browse/SPARK-33363 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Priority: Minor > Attachments: screenshot.png > > > The information printed when pyspark starts does not prompt info such as > :current applicationId, application URL, master type, and it is not very > convenient -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33363) Add prompt information related to the current task when pyspark starts
[ https://issues.apache.org/jira/browse/SPARK-33363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226867#comment-17226867 ] Apache Spark commented on SPARK-33363: -- User 'akiyamaneko' has created a pull request for this issue: https://github.com/apache/spark/pull/30266 > Add prompt information related to the current task when pyspark starts > -- > > Key: SPARK-33363 > URL: https://issues.apache.org/jira/browse/SPARK-33363 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Priority: Minor > Attachments: screenshot.png > > > The information printed when pyspark starts does not prompt info such as > :current applicationId, application URL, master type, and it is not very > convenient -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33363) Add prompt information related to the current task when pyspark starts
[ https://issues.apache.org/jira/browse/SPARK-33363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33363: Assignee: Apache Spark > Add prompt information related to the current task when pyspark starts > -- > > Key: SPARK-33363 > URL: https://issues.apache.org/jira/browse/SPARK-33363 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Assignee: Apache Spark >Priority: Minor > Attachments: screenshot.png > > > The information printed when pyspark starts does not prompt info such as > :current applicationId, application URL, master type, and it is not very > convenient -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33363) Add prompt information related to the current task when pyspark starts
[ https://issues.apache.org/jira/browse/SPARK-33363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] akiyamaneko updated SPARK-33363: Attachment: screenshot.png > Add prompt information related to the current task when pyspark starts > -- > > Key: SPARK-33363 > URL: https://issues.apache.org/jira/browse/SPARK-33363 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Priority: Minor > Attachments: screenshot.png > > > The information printed when pyspark starts does not prompt info such as > :current applicationId, application URL, master type, and it is not very > convenient -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33363) Add prompt information related to the current task when pyspark starts
akiyamaneko created SPARK-33363: --- Summary: Add prompt information related to the current task when pyspark starts Key: SPARK-33363 URL: https://issues.apache.org/jira/browse/SPARK-33363 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.0.1 Reporter: akiyamaneko The information printed when pyspark starts does not prompt info such as :current applicationId, application URL, master type, and it is not very convenient -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33349) ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed
[ https://issues.apache.org/jira/browse/SPARK-33349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226834#comment-17226834 ] Dongjoon Hyun commented on SPARK-33349: --- Hi, [~redsk]. It's possible if we have a test case. For now, we don't. Given that, since 4.12.0 has a breaking API change for that, I'm not sure about that the upgrade. When I saw that error, I killed the job. So, I don't know it causes a hang or not. > ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed > -- > > Key: SPARK-33349 > URL: https://issues.apache.org/jira/browse/SPARK-33349 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.1, 3.0.2, 3.1.0 >Reporter: Nicola Bova >Priority: Critical > > I launch my spark application with the > [spark-on-kubernetes-operator|https://github.com/GoogleCloudPlatform/spark-on-k8s-operator] > with the following yaml file: > {code:yaml} > apiVersion: sparkoperator.k8s.io/v1beta2 > kind: SparkApplication > metadata: > name: spark-kafka-streamer-test > namespace: kafka2hdfs > spec: > type: Scala > mode: cluster > image: /spark:3.0.2-SNAPSHOT-2.12-0.1.0 > imagePullPolicy: Always > timeToLiveSeconds: 259200 > mainClass: path.to.my.class.KafkaStreamer > mainApplicationFile: spark-kafka-streamer_2.12-spark300-assembly.jar > sparkVersion: 3.0.1 > restartPolicy: > type: Always > sparkConf: > "spark.kafka.consumer.cache.capacity": "8192" > "spark.kubernetes.memoryOverheadFactor": "0.3" > deps: > jars: > - my > - jar > - list > hadoopConfigMap: hdfs-config > driver: > cores: 4 > memory: 12g > labels: > version: 3.0.1 > serviceAccount: default > javaOptions: > "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties" > executor: > instances: 4 > cores: 4 > memory: 16g > labels: > version: 3.0.1 > javaOptions: > "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties" > {code} > I have tried with both Spark `3.0.1` and `3.0.2-SNAPSHOT` with the ["Restart > the watcher when we receive a version changed from > k8s"|https://github.com/apache/spark/pull/29533] patch. > This is the driver log: > {code} > 20/11/04 12:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > ... // my app log, it's a structured streaming app reading from kafka and > writing to hdfs > 20/11/04 13:12:12 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has > been closed (this is expected if the application is shutting down.) > io.fabric8.kubernetes.client.KubernetesClientException: too old resource > version: 1574101276 (1574213896) > at > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) > at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) > at > okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) > at > okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) > at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > at java.base/java.lang.Thread.run(Unknown Source) > {code} > The error above appears after roughly 50 minutes. > After the exception above, no more logs are produced and the app hangs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33362) skipSchemaResolution should still require query to be resolved
[ https://issues.apache.org/jira/browse/SPARK-33362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226745#comment-17226745 ] Apache Spark commented on SPARK-33362: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/30265 > skipSchemaResolution should still require query to be resolved > -- > > Key: SPARK-33362 > URL: https://issues.apache.org/jira/browse/SPARK-33362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33362) skipSchemaResolution should still require query to be resolved
[ https://issues.apache.org/jira/browse/SPARK-33362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33362: Assignee: Wenchen Fan (was: Apache Spark) > skipSchemaResolution should still require query to be resolved > -- > > Key: SPARK-33362 > URL: https://issues.apache.org/jira/browse/SPARK-33362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33362) skipSchemaResolution should still require query to be resolved
[ https://issues.apache.org/jira/browse/SPARK-33362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33362: Assignee: Apache Spark (was: Wenchen Fan) > skipSchemaResolution should still require query to be resolved > -- > > Key: SPARK-33362 > URL: https://issues.apache.org/jira/browse/SPARK-33362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33362) skipSchemaResolution should still require query to be resolved
Wenchen Fan created SPARK-33362: --- Summary: skipSchemaResolution should still require query to be resolved Key: SPARK-33362 URL: https://issues.apache.org/jira/browse/SPARK-33362 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33360) simplify DS v2 write resolution
[ https://issues.apache.org/jira/browse/SPARK-33360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33360: Assignee: Wenchen Fan (was: Apache Spark) > simplify DS v2 write resolution > --- > > Key: SPARK-33360 > URL: https://issues.apache.org/jira/browse/SPARK-33360 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33360) simplify DS v2 write resolution
[ https://issues.apache.org/jira/browse/SPARK-33360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33360: Assignee: Apache Spark (was: Wenchen Fan) > simplify DS v2 write resolution > --- > > Key: SPARK-33360 > URL: https://issues.apache.org/jira/browse/SPARK-33360 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33360) simplify DS v2 write resolution
[ https://issues.apache.org/jira/browse/SPARK-33360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226739#comment-17226739 ] Apache Spark commented on SPARK-33360: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/30264 > simplify DS v2 write resolution > --- > > Key: SPARK-33360 > URL: https://issues.apache.org/jira/browse/SPARK-33360 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33360) simplify DS v2 write resolution
[ https://issues.apache.org/jira/browse/SPARK-33360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226738#comment-17226738 ] Apache Spark commented on SPARK-33360: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/30264 > simplify DS v2 write resolution > --- > > Key: SPARK-33360 > URL: https://issues.apache.org/jira/browse/SPARK-33360 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33361) Dataset.observe() functionality does not work with structured streaming
John Wesley created SPARK-33361: --- Summary: Dataset.observe() functionality does not work with structured streaming Key: SPARK-33361 URL: https://issues.apache.org/jira/browse/SPARK-33361 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.0.0 Environment: Spark on k8s, version 3.0.0 Reporter: John Wesley The dataset observe() functionality does not work as expected with spark in cluster mode. Related discussion here: [http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-emit-custom-metrics-to-Prometheus-in-spark-structured-streaming-td38826.html] Using lit() as the aggregation column goes through well. However sum, count etc returns 0 all the time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33360) simplify DS v2 write resolution
Wenchen Fan created SPARK-33360: --- Summary: simplify DS v2 write resolution Key: SPARK-33360 URL: https://issues.apache.org/jira/browse/SPARK-33360 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33359) foreachBatch sink outputs wrong metrics
John Wesley created SPARK-33359: --- Summary: foreachBatch sink outputs wrong metrics Key: SPARK-33359 URL: https://issues.apache.org/jira/browse/SPARK-33359 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.0.0 Environment: Spark on Kubernetes cluster with spark-3.0.0 image. The CRD is ScheduledSparkApplication Reporter: John Wesley I created 2 similar jobs, 1) First job reading from kafka and writing to console sink in append mode 2) Second job reading from kafka and writing to foreachBatch sink (which then writes in parquet format to S3). The metrics in the log for console shows correct values for numInputRows and numOutputRows whereas they are wrong for foreachBatch. With foreachBatch: numInputRows is +1 more than what is actually present numOutputRows is always -1. ///Console sink //20/11/05 13:36:21 INFO MicroBatchExecution: Streaming query made progress: { "id" : "775aa543-58bf-4cf7-b274-390da640b6ae", "runId" : "e5eac4ca-0b29-4ed4-be35-b70bd20906d5", "name" : null, "timestamp" : "2020-11-05T13:36:08.921Z", "batchId" : 0, "numInputRows" : 10, "processedRowsPerSecond" : 0.7759757895553658, "durationMs" : { "addBatch" : 7735, "getBatch" : 152, "latestOffset" : 2037, "queryPlanning" : 1010, "triggerExecution" : 12886, "walCommit" : 938 }, "stateOperators" : [ ], "sources" : [ { "description" : "KafkaV2[Subscribe[testedr7]]", "startOffset" : null, "endOffset" : { "testedr7" : { "0" : 10 } }, "numInputRows" : 10, "processedRowsPerSecond" : 0.7759757895553658 } ], "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleTable$@38c3a814", "numOutputRows" : 10 } } ///ForEachBatch Sink //20/11/05 13:43:38 INFO MicroBatchExecution: Streaming query made progress: { "id" : "789f9a00-2f2a-4f75-b643-fea201088b4a", "runId" : "b5e695c5-3a2e-4ad2-9dbf-11b69f368f61", "name" : null, "timestamp" : "2020-11-05T13:43:15.421Z", "batchId" : 0, "numInputRows" : 11, "processedRowsPerSecond" : 0.4833252779120348, "durationMs" : { "addBatch" : 17689, "getBatch" : 135, "latestOffset" : 2121, "queryPlanning" : 880, "triggerExecution" : 22758, "walCommit" : 876 }, "stateOperators" : [ ], "sources" : [ { "description" : "KafkaV2[Subscribe[testedr7]]", "startOffset" : null, "endOffset" : { "testedr7" : { "0" : 10 } }, "numInputRows" : 11, "processedRowsPerSecond" : 0.4833252779120348 } ], "sink" : { "description" : "ForeachBatchSink", "numOutputRows" : -1 } } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah updated SPARK-33327: --- Description: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates ** *{color:#de350b}incorrect{color}* {color}results: * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") was: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates {color:#de350b}*incorrect*{color}{color} results: * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") > grouped by first and last against date column returns incorrect results > --- > > Key: SPARK-33327 > URL: https://issues.apache.org/jira/browse/SPARK-33327 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7 >Reporter: Punit Shah >Priority: Major > Attachments: users.csv > > > The attached csv file has two columns, namely "User" and "FromDate". The > import defaults the "FromDate" column as a timestamp. > * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) > * outDF.createOrReplaceTempView("table02") > In this default case the following sql generates > {color:#de350b}*incorrect*{color} results: > {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as > `FromDate_First`, last(`FromDate`) as `FromDate_Last`, > count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} > {color:#172b4d}However if we read the dataframe like so (where the "FromDate" > is read in as a Date, then the above sql query also generates ** > *{color:#de350b}incorrect{color}* {color}results: > * outDF = spark_session.read.csv("users.csv", inferSchema=True, > header=True).selectExpr("`User`", "cast(`FromDate` as date)") > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah updated SPARK-33327: --- Description: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query {color:#de350b}*also*{color} generates *incorrect* {color} results: * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") was: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates *incorrect* {color} results: * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") > grouped by first and last against date column returns incorrect results > --- > > Key: SPARK-33327 > URL: https://issues.apache.org/jira/browse/SPARK-33327 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7 >Reporter: Punit Shah >Priority: Major > Attachments: users.csv > > > The attached csv file has two columns, namely "User" and "FromDate". The > import defaults the "FromDate" column as a timestamp. > * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) > * outDF.createOrReplaceTempView("table02") > In this default case the following sql generates > {color:#de350b}*incorrect*{color} results: > {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as > `FromDate_First`, last(`FromDate`) as `FromDate_Last`, > count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} > {color:#172b4d}However if we read the dataframe like so (where the "FromDate" > is read in as a Date, then the above sql query {color:#de350b}*also*{color} > generates *incorrect* {color} results: > * outDF = spark_session.read.csv("users.csv", inferSchema=True, > header=True).selectExpr("`User`", "cast(`FromDate` as date)") > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah updated SPARK-33327: --- Description: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates *incorrect* {color} results: * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") was: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates ** *{color:#de350b}incorrect{color}* {color}results: * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") > grouped by first and last against date column returns incorrect results > --- > > Key: SPARK-33327 > URL: https://issues.apache.org/jira/browse/SPARK-33327 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7 >Reporter: Punit Shah >Priority: Major > Attachments: users.csv > > > The attached csv file has two columns, namely "User" and "FromDate". The > import defaults the "FromDate" column as a timestamp. > * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) > * outDF.createOrReplaceTempView("table02") > In this default case the following sql generates > {color:#de350b}*incorrect*{color} results: > {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as > `FromDate_First`, last(`FromDate`) as `FromDate_Last`, > count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} > {color:#172b4d}However if we read the dataframe like so (where the "FromDate" > is read in as a Date, then the above sql query also generates *incorrect* > {color} results: > * outDF = spark_session.read.csv("users.csv", inferSchema=True, > header=True).selectExpr("`User`", "cast(`FromDate` as date)") > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah updated SPARK-33327: --- Description: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates {color:#de350b}*incorrect*{color}{color} results: * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") was: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates {color:#de350b}*incorrect*{color} results:{color} * {color:#172b4d}outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)"){color} > grouped by first and last against date column returns incorrect results > --- > > Key: SPARK-33327 > URL: https://issues.apache.org/jira/browse/SPARK-33327 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7 >Reporter: Punit Shah >Priority: Major > Attachments: users.csv > > > The attached csv file has two columns, namely "User" and "FromDate". The > import defaults the "FromDate" column as a timestamp. > * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) > * outDF.createOrReplaceTempView("table02") > In this default case the following sql generates > {color:#de350b}*incorrect*{color} results: > {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as > `FromDate_First`, last(`FromDate`) as `FromDate_Last`, > count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} > {color:#172b4d}However if we read the dataframe like so (where the "FromDate" > is read in as a Date, then the above sql query also generates > {color:#de350b}*incorrect*{color}{color} results: > * outDF = spark_session.read.csv("users.csv", inferSchema=True, > header=True).selectExpr("`User`", "cast(`FromDate` as date)") > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah updated SPARK-33327: --- Description: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates {color:#de350b}*incorrect*{color} results:{color} * {color:#172b4d}outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)"){color} was: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates correct results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query generates incorrect results:{color} * {color:#172b4d}outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)"){color} > grouped by first and last against date column returns incorrect results > --- > > Key: SPARK-33327 > URL: https://issues.apache.org/jira/browse/SPARK-33327 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7 >Reporter: Punit Shah >Priority: Major > Attachments: users.csv > > > The attached csv file has two columns, namely "User" and "FromDate". The > import defaults the "FromDate" column as a timestamp. > * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) > * outDF.createOrReplaceTempView("table02") > In this default case the following sql generates > {color:#de350b}*incorrect*{color} results: > {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as > `FromDate_First`, last(`FromDate`) as `FromDate_Last`, > count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} > {color:#172b4d}However if we read the dataframe like so (where the "FromDate" > is read in as a Date, then the above sql query also generates > {color:#de350b}*incorrect*{color} results:{color} > * {color:#172b4d}outDF = spark_session.read.csv("users.csv", > inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as > date)"){color} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33327) grouped by first and last against date column returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226712#comment-17226712 ] Punit Shah commented on SPARK-33327: The correct behaviour of running the query should be: cnt, FromDate_First, FromDate_Last, cntdist 15, 2013-02-21, 2013-12-13, 4 or: cnt, FromDate_First, FromDate_Last, cntdist 15, 2013-02-21 00:00:00, 2013-12-13 00:00:00, 4 Thanks for asking [~hyukjin.kwon] Now I notice that both imports fail as shown below: The spark_session.read.csv("users.csv", inferSchema=True, header=True) behaves incorrectly like: cnt, FromDate_First, FromDate_Last, cntdist 15, 2013-12-13 00:00:00, 2013-03-18 00:00:00, 4 The spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") also behaves incorrectly like so: cnt, FromDate_First, FromDate_Last, cntdist 15, 2013-12-13 , 2013-02-21 , 4 > grouped by first and last against date column returns incorrect results > --- > > Key: SPARK-33327 > URL: https://issues.apache.org/jira/browse/SPARK-33327 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7 >Reporter: Punit Shah >Priority: Major > Attachments: users.csv > > > The attached csv file has two columns, namely "User" and "FromDate". The > import defaults the "FromDate" column as a timestamp. > * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) > * outDF.createOrReplaceTempView("table02") > In this default case the following sql generates correct results: > {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as > `FromDate_First`, last(`FromDate`) as `FromDate_Last`, > count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} > {color:#172b4d}However if we read the dataframe like so (where the "FromDate" > is read in as a Date, then the above sql query generates incorrect > results:{color} > * {color:#172b4d}outDF = spark_session.read.csv("users.csv", > inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as > date)"){color} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33349) ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed
[ https://issues.apache.org/jira/browse/SPARK-33349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226665#comment-17226665 ] Nicola Bova commented on SPARK-33349: - [~dongjoon] Would you consider upgrading kubernetes-client to 4.12.0? I'm not 100% sure but this issue seems to completely hang the spark application on k8s, at least this is what I think it's happening in my case. Did the spark app hang when you encountered this problem? > ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed > -- > > Key: SPARK-33349 > URL: https://issues.apache.org/jira/browse/SPARK-33349 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.1, 3.0.2, 3.1.0 >Reporter: Nicola Bova >Priority: Critical > > I launch my spark application with the > [spark-on-kubernetes-operator|https://github.com/GoogleCloudPlatform/spark-on-k8s-operator] > with the following yaml file: > {code:yaml} > apiVersion: sparkoperator.k8s.io/v1beta2 > kind: SparkApplication > metadata: > name: spark-kafka-streamer-test > namespace: kafka2hdfs > spec: > type: Scala > mode: cluster > image: /spark:3.0.2-SNAPSHOT-2.12-0.1.0 > imagePullPolicy: Always > timeToLiveSeconds: 259200 > mainClass: path.to.my.class.KafkaStreamer > mainApplicationFile: spark-kafka-streamer_2.12-spark300-assembly.jar > sparkVersion: 3.0.1 > restartPolicy: > type: Always > sparkConf: > "spark.kafka.consumer.cache.capacity": "8192" > "spark.kubernetes.memoryOverheadFactor": "0.3" > deps: > jars: > - my > - jar > - list > hadoopConfigMap: hdfs-config > driver: > cores: 4 > memory: 12g > labels: > version: 3.0.1 > serviceAccount: default > javaOptions: > "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties" > executor: > instances: 4 > cores: 4 > memory: 16g > labels: > version: 3.0.1 > javaOptions: > "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties" > {code} > I have tried with both Spark `3.0.1` and `3.0.2-SNAPSHOT` with the ["Restart > the watcher when we receive a version changed from > k8s"|https://github.com/apache/spark/pull/29533] patch. > This is the driver log: > {code} > 20/11/04 12:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > ... // my app log, it's a structured streaming app reading from kafka and > writing to hdfs > 20/11/04 13:12:12 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has > been closed (this is expected if the application is shutting down.) > io.fabric8.kubernetes.client.KubernetesClientException: too old resource > version: 1574101276 (1574213896) > at > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) > at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) > at > okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) > at > okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) > at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > at java.base/java.lang.Thread.run(Unknown Source) > {code} > The error above appears after roughly 50 minutes. > After the exception above, no more logs are produced and the app hangs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33358) Spark SQL CLI command processing loop can't exit while one comand fail
[ https://issues.apache.org/jira/browse/SPARK-33358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33358: Assignee: (was: Apache Spark) > Spark SQL CLI command processing loop can't exit while one comand fail > -- > > Key: SPARK-33358 > URL: https://issues.apache.org/jira/browse/SPARK-33358 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Lichuanliang >Priority: Major > > > When submit a multiple statements sql script through bin/spark-sql, if one of > the prior command fail, the processing loop will not exit and continuing > executing the following statements, and finally makes the whole program > success. > > {code:java} > for (oneCmd <- commands) { > if (StringUtils.endsWith(oneCmd, "\\")) { > command += StringUtils.chop(oneCmd) + ";" > } else { > command += oneCmd > if (!StringUtils.isBlank(command)) { > val ret = processCmd(command) > command = "" > lastRet = ret > val ignoreErrors = HiveConf.getBoolVar(conf, > HiveConf.ConfVars.CLIIGNOREERRORS) > if (ret != 0 && !ignoreErrors) { > CommandProcessorFactory.clean(conf.asInstanceOf[HiveConf]) > ret // for loop will not return if one of the commands fail > } > } > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33358) Spark SQL CLI command processing loop can't exit while one comand fail
[ https://issues.apache.org/jira/browse/SPARK-33358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33358: Assignee: Apache Spark > Spark SQL CLI command processing loop can't exit while one comand fail > -- > > Key: SPARK-33358 > URL: https://issues.apache.org/jira/browse/SPARK-33358 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Lichuanliang >Assignee: Apache Spark >Priority: Major > > > When submit a multiple statements sql script through bin/spark-sql, if one of > the prior command fail, the processing loop will not exit and continuing > executing the following statements, and finally makes the whole program > success. > > {code:java} > for (oneCmd <- commands) { > if (StringUtils.endsWith(oneCmd, "\\")) { > command += StringUtils.chop(oneCmd) + ";" > } else { > command += oneCmd > if (!StringUtils.isBlank(command)) { > val ret = processCmd(command) > command = "" > lastRet = ret > val ignoreErrors = HiveConf.getBoolVar(conf, > HiveConf.ConfVars.CLIIGNOREERRORS) > if (ret != 0 && !ignoreErrors) { > CommandProcessorFactory.clean(conf.asInstanceOf[HiveConf]) > ret // for loop will not return if one of the commands fail > } > } > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33358) Spark SQL CLI command processing loop can't exit while one comand fail
[ https://issues.apache.org/jira/browse/SPARK-33358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226629#comment-17226629 ] Apache Spark commented on SPARK-33358: -- User 'artiship' has created a pull request for this issue: https://github.com/apache/spark/pull/30263 > Spark SQL CLI command processing loop can't exit while one comand fail > -- > > Key: SPARK-33358 > URL: https://issues.apache.org/jira/browse/SPARK-33358 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Lichuanliang >Priority: Major > > > When submit a multiple statements sql script through bin/spark-sql, if one of > the prior command fail, the processing loop will not exit and continuing > executing the following statements, and finally makes the whole program > success. > > {code:java} > for (oneCmd <- commands) { > if (StringUtils.endsWith(oneCmd, "\\")) { > command += StringUtils.chop(oneCmd) + ";" > } else { > command += oneCmd > if (!StringUtils.isBlank(command)) { > val ret = processCmd(command) > command = "" > lastRet = ret > val ignoreErrors = HiveConf.getBoolVar(conf, > HiveConf.ConfVars.CLIIGNOREERRORS) > if (ret != 0 && !ignoreErrors) { > CommandProcessorFactory.clean(conf.asInstanceOf[HiveConf]) > ret // for loop will not return if one of the commands fail > } > } > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33358) Spark SQL CLI command processing loop can't exit while one comand fail
Lichuanliang created SPARK-33358: Summary: Spark SQL CLI command processing loop can't exit while one comand fail Key: SPARK-33358 URL: https://issues.apache.org/jira/browse/SPARK-33358 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1 Reporter: Lichuanliang When submit a multiple statements sql script through bin/spark-sql, if one of the prior command fail, the processing loop will not exit and continuing executing the following statements, and finally makes the whole program success. {code:java} for (oneCmd <- commands) { if (StringUtils.endsWith(oneCmd, "\\")) { command += StringUtils.chop(oneCmd) + ";" } else { command += oneCmd if (!StringUtils.isBlank(command)) { val ret = processCmd(command) command = "" lastRet = ret val ignoreErrors = HiveConf.getBoolVar(conf, HiveConf.ConfVars.CLIIGNOREERRORS) if (ret != 0 && !ignoreErrors) { CommandProcessorFactory.clean(conf.asInstanceOf[HiveConf]) ret // for loop will not return if one of the commands fail } } } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33356) DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion
[ https://issues.apache.org/jira/browse/SPARK-33356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33356: Assignee: (was: Apache Spark) > DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion > - > > Key: SPARK-33356 > URL: https://issues.apache.org/jira/browse/SPARK-33356 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.2, 3.0.1 > Environment: Reproducible locally with 3.0.1, 2.4.2, and latest > master. >Reporter: Lucas Brutschy >Priority: Minor > > The current implementation of the {{DAGScheduler}} exhibits exponential > runtime in DAGs with many {{PartitionerAwareUnions}}. The reason seems to be > a mutual recursion between {{PartitionerAwareUnion.getPreferredLocations}} > and {{DAGScheduler.getPreferredLocs}}. > A minimal example reproducing the issue: > {code:scala} > object Example extends App { > val partitioner = new HashPartitioner(2) > val sc = new SparkContext(new > SparkConf().setAppName("").setMaster("local[*]")) > val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner) > val rdd2 = (1 to 30).map(_ => rdd1) > val rdd3 = rdd2.reduce(_ union _) > rdd3.collect() > } > {code} > The whole app should take around one second to complete, as no actual work is > done. However, it takes more time to submit the job than I am willing to wait. > The underlying cause appears to be mutual recursion between > {{PartitionerAwareUnion.getPreferredLocations}} and > {{DAGScheduler.getPreferredLocs}}, which restarts graph traversal at each > {{PartitionerAwareUnion}} with no memoization. Each node of the DAG is > visited {{O(n!)}} (exponentially many) times. > Note, that it is clear to me that you could use {{sc.union(rdd2)}} instead of > {{rdd2.reduce(_ union _)}} to eliminate the problem. I use this just to > demonstrate the issue in a sufficiently small example. Given a large DAG and > many PartitionerAwareUnions, especially contructed by iterative algorithms, > the problem can become relevant even without "abuse" of the union operation. > The exponential recursion in DAG Schedular was largely fixed with SPARK-682, > but in the special case of PartitionerAwareUnion, it is still possible. This > may actually be an underlying cause of SPARK-29181. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33356) DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion
[ https://issues.apache.org/jira/browse/SPARK-33356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226611#comment-17226611 ] Apache Spark commented on SPARK-33356: -- User 'lucasbru' has created a pull request for this issue: https://github.com/apache/spark/pull/30262 > DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion > - > > Key: SPARK-33356 > URL: https://issues.apache.org/jira/browse/SPARK-33356 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.2, 3.0.1 > Environment: Reproducible locally with 3.0.1, 2.4.2, and latest > master. >Reporter: Lucas Brutschy >Priority: Minor > > The current implementation of the {{DAGScheduler}} exhibits exponential > runtime in DAGs with many {{PartitionerAwareUnions}}. The reason seems to be > a mutual recursion between {{PartitionerAwareUnion.getPreferredLocations}} > and {{DAGScheduler.getPreferredLocs}}. > A minimal example reproducing the issue: > {code:scala} > object Example extends App { > val partitioner = new HashPartitioner(2) > val sc = new SparkContext(new > SparkConf().setAppName("").setMaster("local[*]")) > val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner) > val rdd2 = (1 to 30).map(_ => rdd1) > val rdd3 = rdd2.reduce(_ union _) > rdd3.collect() > } > {code} > The whole app should take around one second to complete, as no actual work is > done. However, it takes more time to submit the job than I am willing to wait. > The underlying cause appears to be mutual recursion between > {{PartitionerAwareUnion.getPreferredLocations}} and > {{DAGScheduler.getPreferredLocs}}, which restarts graph traversal at each > {{PartitionerAwareUnion}} with no memoization. Each node of the DAG is > visited {{O(n!)}} (exponentially many) times. > Note, that it is clear to me that you could use {{sc.union(rdd2)}} instead of > {{rdd2.reduce(_ union _)}} to eliminate the problem. I use this just to > demonstrate the issue in a sufficiently small example. Given a large DAG and > many PartitionerAwareUnions, especially contructed by iterative algorithms, > the problem can become relevant even without "abuse" of the union operation. > The exponential recursion in DAG Schedular was largely fixed with SPARK-682, > but in the special case of PartitionerAwareUnion, it is still possible. This > may actually be an underlying cause of SPARK-29181. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33356) DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion
[ https://issues.apache.org/jira/browse/SPARK-33356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33356: Assignee: Apache Spark > DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion > - > > Key: SPARK-33356 > URL: https://issues.apache.org/jira/browse/SPARK-33356 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.2, 3.0.1 > Environment: Reproducible locally with 3.0.1, 2.4.2, and latest > master. >Reporter: Lucas Brutschy >Assignee: Apache Spark >Priority: Minor > > The current implementation of the {{DAGScheduler}} exhibits exponential > runtime in DAGs with many {{PartitionerAwareUnions}}. The reason seems to be > a mutual recursion between {{PartitionerAwareUnion.getPreferredLocations}} > and {{DAGScheduler.getPreferredLocs}}. > A minimal example reproducing the issue: > {code:scala} > object Example extends App { > val partitioner = new HashPartitioner(2) > val sc = new SparkContext(new > SparkConf().setAppName("").setMaster("local[*]")) > val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner) > val rdd2 = (1 to 30).map(_ => rdd1) > val rdd3 = rdd2.reduce(_ union _) > rdd3.collect() > } > {code} > The whole app should take around one second to complete, as no actual work is > done. However, it takes more time to submit the job than I am willing to wait. > The underlying cause appears to be mutual recursion between > {{PartitionerAwareUnion.getPreferredLocations}} and > {{DAGScheduler.getPreferredLocs}}, which restarts graph traversal at each > {{PartitionerAwareUnion}} with no memoization. Each node of the DAG is > visited {{O(n!)}} (exponentially many) times. > Note, that it is clear to me that you could use {{sc.union(rdd2)}} instead of > {{rdd2.reduce(_ union _)}} to eliminate the problem. I use this just to > demonstrate the issue in a sufficiently small example. Given a large DAG and > many PartitionerAwareUnions, especially contructed by iterative algorithms, > the problem can become relevant even without "abuse" of the union operation. > The exponential recursion in DAG Schedular was largely fixed with SPARK-682, > but in the special case of PartitionerAwareUnion, it is still possible. This > may actually be an underlying cause of SPARK-29181. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33357) Support SparkLauncher in Kubernetes
hong dongdong created SPARK-33357: - Summary: Support SparkLauncher in Kubernetes Key: SPARK-33357 URL: https://issues.apache.org/jira/browse/SPARK-33357 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 3.0.1 Reporter: hong dongdong Now, SparkAppHandle can not get state report in k8s, we can support it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33356) DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion
[ https://issues.apache.org/jira/browse/SPARK-33356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lucas Brutschy updated SPARK-33356: --- Description: The current implementation of the {{DAGScheduler}} exhibits exponential runtime in DAGs with many {{PartitionerAwareUnions}}. The reason seems to be a mutual recursion between {{PartitionerAwareUnion.getPreferredLocations}} and {{DAGScheduler.getPreferredLocs}}. A minimal example reproducing the issue: {code:scala} object Example extends App { val partitioner = new HashPartitioner(2) val sc = new SparkContext(new SparkConf().setAppName("").setMaster("local[*]")) val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner) val rdd2 = (1 to 30).map(_ => rdd1) val rdd3 = rdd2.reduce(_ union _) rdd3.collect() } {code} The whole app should take around one second to complete, as no actual work is done. However, it takes more time to submit the job than I am willing to wait. The underlying cause appears to be mutual recursion between {{PartitionerAwareUnion.getPreferredLocations}} and {{DAGScheduler.getPreferredLocs}}, which restarts graph traversal at each {{PartitionerAwareUnion}} with no memoization. Each node of the DAG is visited {{O(n!)}} (exponentially many) times. Note, that it is clear to me that you could use {{sc.union(rdd2)}} instead of {{rdd2.reduce(_ union _)}} to eliminate the problem. I use this just to demonstrate the issue in a sufficiently small example. Given a large DAG and many PartitionerAwareUnions, especially contructed by iterative algorithms, the problem can become relevant even without "abuse" of the union operation. The exponential recursion in DAG Schedular was largely fixed with SPARK-682, but in the special case of PartitionerAwareUnion, it is still possible. This may actually be an underlying cause of SPARK-29181. was: The current implementation of the {{DAGScheduler}} exhibits exponential runtime in DAGs with many {{PartitionerAwareUnions}}. The reason seems to be a mutual recursion between {{PartitionerAwareUnion.getPreferredLocations}} and {{DAGScheduler.getPreferredLocs}}. A minimal example reproducing the issue: {code:java} object Example extends App { val partitioner = new HashPartitioner(2) val sc = new SparkContext(new SparkConf().setAppName("").setMaster("local[*]")) val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner) val rdd2 = (1 to 30).map(_ => rdd1) val rdd3 = rdd2.reduce(_ union _) rdd3.collect() } {code} The whole app should take around one second to complete, as no actual work is done. However, it takes more time to submit the job than I am willing to wait. The underlying cause appears to be mutual recursion between {{PartitionerAwareUnion.getPreferredLocations}} and {{DAGScheduler.getPreferredLocs}}, which restarts graph traversal at each {{PartitionerAwareUnion}} with no memoization. Each node of the DAG is visited {{O(n!)}} (exponentially many) times. Note, that it is clear to me that you could use {{sc.union(rdd2)}} instead of {{rdd2.reduce(_ union _)}} to eliminate the problem. I use this just to demonstrate the issue in a sufficiently small example. Given a large DAG and many PartitionerAwareUnions, especially contructed by iterative algorithms, the problem can become relevant even without "abuse" of the union operation. The exponential recursion in DAG Schedular was largely fixed with SPARK-682, but in the special case of PartitionerAwareUnion, it is still possible. This may actually be an underlying cause of SPARK-29181. > DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion > - > > Key: SPARK-33356 > URL: https://issues.apache.org/jira/browse/SPARK-33356 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.2, 3.0.1 > Environment: Reproducible locally with 3.0.1, 2.4.2, and latest > master. >Reporter: Lucas Brutschy >Priority: Minor > > The current implementation of the {{DAGScheduler}} exhibits exponential > runtime in DAGs with many {{PartitionerAwareUnions}}. The reason seems to be > a mutual recursion between {{PartitionerAwareUnion.getPreferredLocations}} > and {{DAGScheduler.getPreferredLocs}}. > A minimal example reproducing the issue: > {code:scala} > object Example extends App { > val partitioner = new HashPartitioner(2) > val sc = new SparkContext(new > SparkConf().setAppName("").setMaster("local[*]")) > val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner) > val rdd2 = (1 to 30).map(_ => rdd1) > val rdd3 = rdd2.reduce(_ union _) > rdd3.collect() > } > {code} > The whole app should take around one second to complete, as no actual work is > done. However, it takes more time to submit the job than I am willing to wait. > The und
[jira] [Created] (SPARK-33356) DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion
Lucas Brutschy created SPARK-33356: -- Summary: DAG Scheduler exhibits exponential runtime with PartitionerAwareUnion Key: SPARK-33356 URL: https://issues.apache.org/jira/browse/SPARK-33356 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.1, 2.4.2 Environment: Reproducible locally with 3.0.1, 2.4.2, and latest master. Reporter: Lucas Brutschy The current implementation of the {{DAGScheduler}} exhibits exponential runtime in DAGs with many {{PartitionerAwareUnions}}. The reason seems to be a mutual recursion between {{PartitionerAwareUnion.getPreferredLocations}} and {{DAGScheduler.getPreferredLocs}}. A minimal example reproducing the issue: {code:java} object Example extends App { val partitioner = new HashPartitioner(2) val sc = new SparkContext(new SparkConf().setAppName("").setMaster("local[*]")) val rdd1 = sc.emptyRDD[(Int, Int)].partitionBy(partitioner) val rdd2 = (1 to 30).map(_ => rdd1) val rdd3 = rdd2.reduce(_ union _) rdd3.collect() } {code} The whole app should take around one second to complete, as no actual work is done. However, it takes more time to submit the job than I am willing to wait. The underlying cause appears to be mutual recursion between {{PartitionerAwareUnion.getPreferredLocations}} and {{DAGScheduler.getPreferredLocs}}, which restarts graph traversal at each {{PartitionerAwareUnion}} with no memoization. Each node of the DAG is visited {{O(n!)}} (exponentially many) times. Note, that it is clear to me that you could use {{sc.union(rdd2)}} instead of {{rdd2.reduce(_ union _)}} to eliminate the problem. I use this just to demonstrate the issue in a sufficiently small example. Given a large DAG and many PartitionerAwareUnions, especially contructed by iterative algorithms, the problem can become relevant even without "abuse" of the union operation. The exponential recursion in DAG Schedular was largely fixed with SPARK-682, but in the special case of PartitionerAwareUnion, it is still possible. This may actually be an underlying cause of SPARK-29181. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30294) Read-only state store unnecessarily creates and deletes the temp file for delta file every batch
[ https://issues.apache.org/jira/browse/SPARK-30294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-30294. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 26935 [https://github.com/apache/spark/pull/26935] > Read-only state store unnecessarily creates and deletes the temp file for > delta file every batch > > > Key: SPARK-30294 > URL: https://issues.apache.org/jira/browse/SPARK-30294 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > Fix For: 3.1.0 > > > [https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155] > {code:java} > /** Abort all the updates made on this store. This store will not be > usable any more. */ > override def abort(): Unit = { > // This if statement is to ensure that files are deleted only if there > are changes to the > // StateStore. We have two StateStores for each task, one which is used > only for reading, and > // the other used for read+write. We don't want the read-only to delete > state files. > if (state == UPDATING) { > state = ABORTED > cancelDeltaFile(compressedStream, deltaFileStream) > } else { > state = ABORTED > } > logInfo(s"Aborted version $newVersion for $this") > } {code} > Despite of the comment, read-only state store also does the same things for > preparing write - creates the temporary file, initializes output streams for > the file, closes these output streams, and deletes the temporary file. That > is just unnecessary and gives confusion as according to the log messages two > different instances seem to write to same delta file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30294) Read-only state store unnecessarily creates and deletes the temp file for delta file every batch
[ https://issues.apache.org/jira/browse/SPARK-30294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-30294: Assignee: Jungtaek Lim > Read-only state store unnecessarily creates and deletes the temp file for > delta file every batch > > > Key: SPARK-30294 > URL: https://issues.apache.org/jira/browse/SPARK-30294 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > > [https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155] > {code:java} > /** Abort all the updates made on this store. This store will not be > usable any more. */ > override def abort(): Unit = { > // This if statement is to ensure that files are deleted only if there > are changes to the > // StateStore. We have two StateStores for each task, one which is used > only for reading, and > // the other used for read+write. We don't want the read-only to delete > state files. > if (state == UPDATING) { > state = ABORTED > cancelDeltaFile(compressedStream, deltaFileStream) > } else { > state = ABORTED > } > logInfo(s"Aborted version $newVersion for $this") > } {code} > Despite of the comment, read-only state store also does the same things for > preparing write - creates the temporary file, initializes output streams for > the file, closes these output streams, and deletes the temporary file. That > is just unnecessary and gives confusion as according to the log messages two > different instances seem to write to same delta file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33336) provided an invalid fromOffset when spark streaming running
[ https://issues.apache.org/jira/browse/SPARK-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226592#comment-17226592 ] Yu Wang commented on SPARK-6: - [~hyukjin.kwon] thanks for your answer , someone has tried the higher version of Spark. https://issues.apache.org/jira/browse/KAFKA-8124 > provided an invalid fromOffset when spark streaming running > --- > > Key: SPARK-6 > URL: https://issues.apache.org/jira/browse/SPARK-6 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.3.2 >Reporter: Yu Wang >Priority: Major > Attachments: image-2020-11-04-17-24-23-782.png, > image-2020-11-04-17-24-42-544.png > > > !image-2020-11-04-17-24-23-782.png|width=923,height=386! > > > !image-2020-11-04-17-24-42-544.png|width=1199,height=260! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33355) Batch fix compilation warnings about "method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is deprecated (since 2.13.0)"
Yang Jie created SPARK-33355: Summary: Batch fix compilation warnings about "method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is deprecated (since 2.13.0)" Key: SPARK-33355 URL: https://issues.apache.org/jira/browse/SPARK-33355 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 3.1.0 Reporter: Yang Jie compilation warnings as follow: {code:java} [WARNING] [Warn] /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala:1711: method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 is deprecated (since 2.13.0): Implicit conversions from Array to immutable.IndexedSeq are implemented by copying; Use the more efficient non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call {code} this Jira is a placeholder now, wait until Scala 2.12 is no longer supported -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32934) Improve the performance for NTH_VALUE and Reactor the OffsetWindowFunction
[ https://issues.apache.org/jira/browse/SPARK-32934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226587#comment-17226587 ] Apache Spark commented on SPARK-32934: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/30261 > Improve the performance for NTH_VALUE and Reactor the OffsetWindowFunction > -- > > Key: SPARK-32934 > URL: https://issues.apache.org/jira/browse/SPARK-32934 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > > Spark SQL support some window function like NTH_VALUE > If we specify window frame like > {code:java} > UNBOUNDED PRECEDING AND CURRENT ROW > {code} > or > {code:java} > UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > {code} > We can elimate some calculations. > For example: if we execute the SQL show below: > {code:java} > SELECT NTH_VALUE(col, > 2) OVER(ORDER BY rank UNBOUNDED PRECEDING > AND CURRENT ROW) > FROM tab; > {code} > The output for row number greater than 1, return the fixed value. otherwise, > return null. So we just calculate the value once and notice whether the row > number less than 2. > UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is simpler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33339) Pyspark application will hang due to non Exception
[ https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lrz updated SPARK-9: Description: When a system.exit exception occurs during the process, the python worker exits abnormally, and then the executor task is still waiting for the worker for reading from socket, causing it to hang. The system.exit exception may be caused by the user's error code, but spark should at least throw an error to remind the user, not get stuck we can run a simple test to reproduce this case: ``` from pyspark.sql import SparkSession def err(line): raise SystemExit spark = SparkSession.builder.appName("test").getOrCreate() spark.sparkContext.parallelize(range(1,2), 2).map(err).collect() spark.stop() ``` was: at pyspark application, worker don't catch BaseException, then once worker call system.exit because of some error, the application will hangup, and will not throw any exception, then the fail cause is not easy to find. for example, run `spark-submit --master yarn-client test.py`, this command will hangup without any information. The test.py content: ``` from pyspark.sql import SparkSession def err(line): raise SystemExit spark = SparkSession.builder.appName("test").getOrCreate() spark.sparkContext.parallelize(range(1,2), 2).map(err).collect() spark.stop() ``` Summary: Pyspark application will hang due to non Exception (was: pyspark application maybe hangup because of worker exit) > Pyspark application will hang due to non Exception > -- > > Key: SPARK-9 > URL: https://issues.apache.org/jira/browse/SPARK-9 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0, 3.0.1 >Reporter: lrz >Priority: Major > > When a system.exit exception occurs during the process, the python worker > exits abnormally, and then the executor task is still waiting for the worker > for reading from socket, causing it to hang. > The system.exit exception may be caused by the user's error code, but spark > should at least throw an error to remind the user, not get stuck > we can run a simple test to reproduce this case: > ``` > from pyspark.sql import SparkSession > def err(line): > raise SystemExit > spark = SparkSession.builder.appName("test").getOrCreate() > spark.sparkContext.parallelize(range(1,2), 2).map(err).collect() > spark.stop() > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org