[jira] [Commented] (SPARK-29006) Support special date/timestamp values `infinity`/`-infinity`
[ https://issues.apache.org/jira/browse/SPARK-29006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926372#comment-16926372 ] Anurag Sharma commented on SPARK-29006: --- [~maxgekk] Thanks, will wait for your code to be merged. > Support special date/timestamp values `infinity`/`-infinity` > > > Key: SPARK-29006 > URL: https://issues.apache.org/jira/browse/SPARK-29006 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > ||Input String||Valid Types||Description|| > |{{infinity}}|{{date}}, {{timestamp}}|later than all other time stamps| > |{{-infinity}}|{{date}}, {{timestamp}}|earlier than all other time stamps| > https://www.postgresql.org/docs/12/datatype-datetime.html -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29024) Ignore case while resolving time zones
[ https://issues.apache.org/jira/browse/SPARK-29024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-29024: --- Summary: Ignore case while resolving time zones (was: Support the `zulu` time zone) > Ignore case while resolving time zones > -- > > Key: SPARK-29024 > URL: https://issues.apache.org/jira/browse/SPARK-29024 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > The `zulu` time zone is checked by > https://github.com/apache/spark/blob/67b4329fb08fd606461aa1ac9274c4a84d15d70e/sql/core/src/test/resources/sql-tests/inputs/pgSQL/timestamp.sql#L31 > but `getZoneId` fails on resolving it: > {code} > scala> getZoneId("zulu") > java.time.zone.ZoneRulesException: Unknown time-zone ID: zulu > at java.time.zone.ZoneRulesProvider.getProvider(ZoneRulesProvider.java:272) > at java.time.zone.ZoneRulesProvider.getRules(ZoneRulesProvider.java:227) > at java.time.ZoneRegion.ofId(ZoneRegion.java:120) > at java.time.ZoneId.of(ZoneId.java:411) > at java.time.ZoneId.of(ZoneId.java:359) > at java.time.ZoneId.of(ZoneId.java:315) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.getZoneId(DateTimeUtils.scala:77) > ... 49 elided > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29032) Simplify Prometheus support by adding `PrometheusServlet`
Dongjoon Hyun created SPARK-29032: - Summary: Simplify Prometheus support by adding `PrometheusServlet` Key: SPARK-29032 URL: https://issues.apache.org/jira/browse/SPARK-29032 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Dongjoon Hyun This issue aims to simplify `Prometheus` support. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29032) Simplify Prometheus support by adding PrometheusServlet
[ https://issues.apache.org/jira/browse/SPARK-29032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29032: -- Summary: Simplify Prometheus support by adding PrometheusServlet (was: Simplify Prometheus support by adding PrometheusServlet`) > Simplify Prometheus support by adding PrometheusServlet > --- > > Key: SPARK-29032 > URL: https://issues.apache.org/jira/browse/SPARK-29032 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to simplify `Prometheus` support. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29032) Simplify Prometheus support by adding PrometheusServlet`
[ https://issues.apache.org/jira/browse/SPARK-29032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29032: -- Summary: Simplify Prometheus support by adding PrometheusServlet` (was: Simplify Prometheus support by adding `PrometheusServlet`) > Simplify Prometheus support by adding PrometheusServlet` > > > Key: SPARK-29032 > URL: https://issues.apache.org/jira/browse/SPARK-29032 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to simplify `Prometheus` support. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29032) Simplify Prometheus support by adding PrometheusServlet
[ https://issues.apache.org/jira/browse/SPARK-29032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29032: -- Description: This issue aims to simplify `Prometheus` support in Spark standalone environment or K8s environment. (was: This issue aims to simplify `Prometheus` support.) > Simplify Prometheus support by adding PrometheusServlet > --- > > Key: SPARK-29032 > URL: https://issues.apache.org/jira/browse/SPARK-29032 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to simplify `Prometheus` support in Spark standalone > environment or K8s environment. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29033) Always use CreateNamedStructUnsafe codepath
Josh Rosen created SPARK-29033: -- Summary: Always use CreateNamedStructUnsafe codepath Key: SPARK-29033 URL: https://issues.apache.org/jira/browse/SPARK-29033 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Josh Rosen Assignee: Josh Rosen Spark 2.x has two separate implementations of the "create named struct" expression: regular {{CreateNamedStruct}} and {{CreateNamedStructUnsafe}}. The "unsafe" version was added in SPARK-9373 to support structs in {{GenerateUnsafeProjection}}. These two expressions both extend the {{CreateNameStructLike}} trait. For Spark 3.0, I propose to always use the "unsafe" code path: this will avoid object allocation / boxing inefficiencies in the "safe" codepath, which is an especially big problem when generating Encoders for deeply-nested structs. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29033) Always use CreateNamedStructUnsafe codepath
[ https://issues.apache.org/jira/browse/SPARK-29033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-29033: --- Issue Type: Improvement (was: Bug) > Always use CreateNamedStructUnsafe codepath > --- > > Key: SPARK-29033 > URL: https://issues.apache.org/jira/browse/SPARK-29033 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > > Spark 2.x has two separate implementations of the "create named struct" > expression: regular {{CreateNamedStruct}} and {{CreateNamedStructUnsafe}}. > The "unsafe" version was added in SPARK-9373 to support structs in > {{GenerateUnsafeProjection}}. These two expressions both extend the > {{CreateNameStructLike}} trait. > For Spark 3.0, I propose to always use the "unsafe" code path: this will > avoid object allocation / boxing inefficiencies in the "safe" codepath, which > is an especially big problem when generating Encoders for deeply-nested > structs. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29033) Always use CreateNamedStructUnsafe codepath
[ https://issues.apache.org/jira/browse/SPARK-29033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-29033: --- Description: Spark 2.x has two separate implementations of the "create named struct" expression: regular {{CreateNamedStruct}} and {{CreateNamedStructUnsafe}}. The "unsafe" version was added in SPARK-9373 to support structs in {{GenerateUnsafeProjection}}. These two expressions both extend the {{CreateNameStructLike}} trait. For Spark 3.0, I propose to always use the "unsafe" code path: this will avoid object allocation / boxing inefficiencies in the "safe" codepath, which is an especially big problem when generating Encoders for deeply-nested case classes. was: Spark 2.x has two separate implementations of the "create named struct" expression: regular {{CreateNamedStruct}} and {{CreateNamedStructUnsafe}}. The "unsafe" version was added in SPARK-9373 to support structs in {{GenerateUnsafeProjection}}. These two expressions both extend the {{CreateNameStructLike}} trait. For Spark 3.0, I propose to always use the "unsafe" code path: this will avoid object allocation / boxing inefficiencies in the "safe" codepath, which is an especially big problem when generating Encoders for deeply-nested structs. > Always use CreateNamedStructUnsafe codepath > --- > > Key: SPARK-29033 > URL: https://issues.apache.org/jira/browse/SPARK-29033 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > > Spark 2.x has two separate implementations of the "create named struct" > expression: regular {{CreateNamedStruct}} and {{CreateNamedStructUnsafe}}. > The "unsafe" version was added in SPARK-9373 to support structs in > {{GenerateUnsafeProjection}}. These two expressions both extend the > {{CreateNameStructLike}} trait. > For Spark 3.0, I propose to always use the "unsafe" code path: this will > avoid object allocation / boxing inefficiencies in the "safe" codepath, which > is an especially big problem when generating Encoders for deeply-nested case > classes. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29034) String Constants with C-style Escapes
Yuming Wang created SPARK-29034: --- Summary: String Constants with C-style Escapes Key: SPARK-29034 URL: https://issues.apache.org/jira/browse/SPARK-29034 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang PostgreSQL also accepts "escape" string constants, which are an extension to the SQL standard. An escape string constant is specified by writing the letter {{E}} (upper or lower case) just before the opening single quote, e.g., {{E'foo'}}. (When continuing an escape string constant across lines, write {{E}} only before the first opening quote.) Within an escape string, a backslash character ({{\}}) begins a C-like _backslash escape_ sequence, in which the combination of backslash and following character(s) represent a special byte value, as shown in [Table 4-1|https://www.postgresql.org/docs/9.3/sql-syntax-lexical.html#SQL-BACKSLASH-TABLE]. *Table 4-1. Backslash Escape Sequences* ||Backslash Escape Sequence||Interpretation|| |{{\b}}|backspace| |{{\f}}|form feed| |{{\n}}|newline| |{{\r}}|carriage return| |{{\t}}|tab| |{{\}}{{o}}, {{\}}{{oo}}, {{\}}{{ooo}} ({{o}} = 0 - 7)|octal byte value| |{{\x}}{{h}}, {{\x}}{{hh}} ({{h}} = 0 - 9, A - F)|hexadecimal byte value| |{{\u}}{{}}, {{\U}}{{}} ({{x}} = 0 - 9, A - F)|16 or 32-bit hexadecimal Unicode character value| Any other character following a backslash is taken literally. Thus, to include a backslash character, write two backslashes ({{\\}}). Also, a single quote can be included in an escape string by writing {{\'}}, in addition to the normal way of {{''}}. It is your responsibility that the byte sequences you create, especially when using the octal or hexadecimal escapes, compose valid characters in the server character set encoding. When the server encoding is UTF-8, then the Unicode escapes or the alternative Unicode escape syntax, explained in [Section 4.1.2.3|https://www.postgresql.org/docs/9.3/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE], should be used instead. (The alternative would be doing the UTF-8 encoding by hand and writing out the bytes, which would be very cumbersome.) The Unicode escape syntax works fully only when the server encoding is {{UTF8}}. When other server encodings are used, only code points in the ASCII range (up to {{\u007F}}) can be specified. Both the 4-digit and the 8-digit form can be used to specify UTF-16 surrogate pairs to compose characters with code points larger than U+, although the availability of the 8-digit form technically makes this unnecessary. (When surrogate pairs are used when the server encoding is {{UTF8}}, they are first combined into a single code point that is then encoded in UTF-8.) [https://www.postgresql.org/docs/11/sql-syntax-lexical.html#SQL-BACKSLASH-TABLE] Example: {code:sql} postgres=# SET bytea_output TO escape; SET postgres=# SELECT E'Th\\000omas'::bytea; bytea Th\000omas (1 row) postgres=# SELECT 'Th\\000omas'::bytea; bytea - Th\\000omas (1 row) {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails
[ https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926420#comment-16926420 ] Gabor Somogyi commented on SPARK-29027: --- [~kabhwan] thanks for pinging. I know of this because I've suggested on the original PR to open this jira. Apart from jenkins runs (which are passing) yesterday I've started this test in a loop with sbt and maven as well but until now haven't failed. What I can think of: * The environment is significantly different from my MAC and from PR builder * The code is not vanilla Spark and has some downstream changes All in all as suggested exact environment description + debug logs would help. > KafkaDelegationTokenSuite fails > --- > > Key: SPARK-29027 > URL: https://issues.apache.org/jira/browse/SPARK-29027 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: {code} > commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 > Author: Sean Owen > Date: Mon Sep 9 10:19:40 2019 -0500 > {code} > Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) >Reporter: koert kuipers >Priority: Minor > > i am seeing consistent failure of KafkaDelegationTokenSuite on master > {code} > JsonUtilsSuite: > - parsing partitions > - parsing partitionOffsets > KafkaDelegationTokenSuite: > javax.security.sasl.SaslException: Failure to initialize security context > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails)] > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125) > at > com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85) > at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48) > at > org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197) > at java.lang.Thread.run(Thread.java:748) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails) > at > sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127) > at > sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193) > at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427) > at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62) > at > sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154) > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108) > ... 12 more > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** > org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure > at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947) > at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924) > at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131) > at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93) > at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > ... > KafkaSourceOffsetSuite: > - comparison {"t":{"0":1}} <=> {"t":{"0":2}} > - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}} > - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}} > - basic serialization - deserialization > - OffsetSeqLog serialization - deserialization > - read Spark 2.1.0 offset format > {code} > {code} > [INFO] Reactor Summary fo
[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails
[ https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926423#comment-16926423 ] Gabor Somogyi commented on SPARK-29027: --- [~koert] are you guys using vanilla Spark or the code contains some downstream changes? > KafkaDelegationTokenSuite fails > --- > > Key: SPARK-29027 > URL: https://issues.apache.org/jira/browse/SPARK-29027 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: {code} > commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 > Author: Sean Owen > Date: Mon Sep 9 10:19:40 2019 -0500 > {code} > Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) >Reporter: koert kuipers >Priority: Minor > > i am seeing consistent failure of KafkaDelegationTokenSuite on master > {code} > JsonUtilsSuite: > - parsing partitions > - parsing partitionOffsets > KafkaDelegationTokenSuite: > javax.security.sasl.SaslException: Failure to initialize security context > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails)] > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125) > at > com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85) > at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48) > at > org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197) > at java.lang.Thread.run(Thread.java:748) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails) > at > sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127) > at > sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193) > at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427) > at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62) > at > sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154) > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108) > ... 12 more > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** > org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure > at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947) > at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924) > at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131) > at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93) > at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > ... > KafkaSourceOffsetSuite: > - comparison {"t":{"0":1}} <=> {"t":{"0":2}} > - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}} > - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}} > - basic serialization - deserialization > - OffsetSeqLog serialization - deserialization > - read Spark 2.1.0 offset format > {code} > {code} > [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 4.178 > s] > [INFO] Spark Project Tags . SUCCESS [ 9.373 > s] > [INFO] Spark Project Sketch ... SUCCESS [ 24.586 > s] > [INFO] Spark Project Local DB . SUCCESS [ 5.456 > s] > [INFO] Spark
[jira] [Updated] (SPARK-26598) Fix HiveThriftServer2 set hiveconf and hivevar in every sql
[ https://issues.apache.org/jira/browse/SPARK-26598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-26598: Issue Type: Bug (was: Improvement) > Fix HiveThriftServer2 set hiveconf and hivevar in every sql > --- > > Key: SPARK-26598 > URL: https://issues.apache.org/jira/browse/SPARK-26598 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: wangtao93 >Assignee: dzcxzl >Priority: Major > Fix For: 3.0.0 > > > [https://github.com/apache/spark/pull/17886,] this pr provide that > hiveserver2 support --haveconf and --hivevar。But it set hiveconf and hivevar > in every sql in class SparkSQLOperationManager,i think this is not > suitable。So i make a little modify to set --hiveconf and --hivevar in class > SparkSQLSessionManager, it will only run once in open HiveServer2 session, > instead of ervery sql to init --hiveconf and --hivevar -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29031) Materialized column to accelerate queries
[ https://issues.apache.org/jira/browse/SPARK-29031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Guo updated SPARK-29031: -- Description: Goals * Add a new SQL grammar of Materialized column * Implicitly rewrite SQL queries on the complex type of columns if there is a materialized columns for it * If the data type of the materialized columns is atomic type, even though the origin column type is in complex type, enable vectorized read and filter pushdown to improve performance Example Create a normal table {quote}CREATE TABLE x ( name STRING, age INT, params STRING, event MAP ) USING parquet; {quote} Add materialized columns to an existing table {quote}ALTER TABLE x ADD COLUMNS ( new_age INT MATERIALIZED age + 1, city STRING MATERIALIZED get_json_object(params, '$.city'), label STRING MATERIALIZED event['label'] ); {quote} When issue a query as below {quote}SELECT name, age+1, get_json_object(params, '$.city'), event['label'] FROM x WHER event['label'] = 'newuser'; {quote} It's equivalent to {quote}SELECT name, new_age, city, label FROM x WHERE label = 'newuser'; {quote} The query performance improved dramatically because # The new query (after rewritten) will read the new column city (in string type) instead of read the whole map of params(in map string). Much lesser data are need to read # Vectorized read can be utilized in the new query and can not be used in the old one. Because vectorized read can only be enabled when all required columns are in atomic type # Filter can be pushdown. Only filters on atomic column can be pushdown. The original filter event['label'] = 'newuser' is on complex column, so it can not be pushdown. # The new query do not need to parse JSON any more. JSON parse is a CPU intensive operation which will impact performance dramatically was: Goals * Add a new SQL grammar of Materialized column * Implicitly rewrite SQL queries on the complex type of columns if there is a materialized columns for it * If the data type of the materialized columns is atomic type, even though the origin column type is in complex type, enable vectorized read and filter pushdown to improve performance Example Create a normal table {quote}CREATE TABLE x ( name STRING, age INT, params STRING, event MAP ) USING parquet; {quote} Add materialized columns to an existing table {quote}ALTER TABLE x ADD COLUMNS ( new_age INT MATERIALIZED age + 1, city STRING MATERIALIZED get_json_object(params, '$.city'), label STRING MATERIALIZED event['label'] ); {quote} When issue a query as below {quote}SELECT name, age+1, get_json_object(params, '$.city'), event['label'] FROM x WHER event['label'] = 'newuser'; {quote} It equals to {quote}SELECT name, new_age, city, label FROM x WHERE label = 'newuser'; {quote} The query performance improved dramatically because # The new query (after rewritten) will read the new column city (in string type) instead of read the whole map of params(in map string). Much lesser data are need to read # Vectorized read can be utilized in the new query and can not be used in the old one. Because vectorized read can only be enabled when all required columns are in atomic type # Filter can be pushdown. Only filters on atomic column can be pushdown. The original filter event['label'] = 'newuser' is on complex column, so it can not be pushdown. # The new query do not need to parse JSON any more. JSON parse is a CPU intensive operation which will impact performance dramatically > Materialized column to accelerate queries > - > > Key: SPARK-29031 > URL: https://issues.apache.org/jira/browse/SPARK-29031 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jason Guo >Priority: Major > Labels: SPIP > > Goals > * Add a new SQL grammar of Materialized column > * Implicitly rewrite SQL queries on the complex type of columns if there is > a materialized columns for it > * If the data type of the materialized columns is atomic type, even though > the origin column type is in complex type, enable vectorized read and filter > pushdown to improve performance > Example > Create a normal table > {quote}CREATE TABLE x ( > name STRING, > age INT, > params STRING, > event MAP > ) USING parquet; > {quote} > > Add materialized columns to an existing table > {quote}ALTER TABLE x ADD COLUMNS ( > new_age INT MATERIALIZED age + 1, > city STRING MATERIALIZED get_json_object(params, '$.city'), > label STRING MATERIALIZED event['label'] > ); > {quote} > > When issue a query as below > {quote}SELECT name, age+1, get_json_object(params, '$.city'), event['label'] > FROM
[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails
[ https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926427#comment-16926427 ] Gabor Somogyi commented on SPARK-29027: --- Hmmm, based on the reactor summary you've provided I see downstream changes. > KafkaDelegationTokenSuite fails > --- > > Key: SPARK-29027 > URL: https://issues.apache.org/jira/browse/SPARK-29027 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: {code} > commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 > Author: Sean Owen > Date: Mon Sep 9 10:19:40 2019 -0500 > {code} > Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) >Reporter: koert kuipers >Priority: Minor > > i am seeing consistent failure of KafkaDelegationTokenSuite on master > {code} > JsonUtilsSuite: > - parsing partitions > - parsing partitionOffsets > KafkaDelegationTokenSuite: > javax.security.sasl.SaslException: Failure to initialize security context > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails)] > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125) > at > com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85) > at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48) > at > org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197) > at java.lang.Thread.run(Thread.java:748) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails) > at > sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127) > at > sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193) > at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427) > at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62) > at > sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154) > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108) > ... 12 more > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** > org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure > at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947) > at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924) > at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131) > at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93) > at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > ... > KafkaSourceOffsetSuite: > - comparison {"t":{"0":1}} <=> {"t":{"0":2}} > - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}} > - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}} > - basic serialization - deserialization > - OffsetSeqLog serialization - deserialization > - read Spark 2.1.0 offset format > {code} > {code} > [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 4.178 > s] > [INFO] Spark Project Tags . SUCCESS [ 9.373 > s] > [INFO] Spark Project Sketch ... SUCCESS [ 24.586 > s] > [INFO] Spark Project Local DB . SUCCESS [ 5.456 > s] > [INFO] Spark Project Netw
[jira] [Created] (SPARK-29035) unpersist() ignoring cache/persist()
Jose Silva created SPARK-29035: -- Summary: unpersist() ignoring cache/persist() Key: SPARK-29035 URL: https://issues.apache.org/jira/browse/SPARK-29035 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.4.3 Environment: Amazon EMR - Spark 2.4.3 Reporter: Jose Silva Calling unpersist(), even though the DataFrame is not used anymore removes all the InMemoryTableScan from the DAG. Here's a simplified version of the code i'm using: df = spark.read(...).where(...).cache() df_a = union(df.select(...), df.select(...), df.select(...)) df_b = df.select(...) df_c = df.select(...) df_d = df.select(...) df.unpersist() join(df_a, df_b, df_c, df_d).write() I've created an [album |https://imgur.com/a/c1xGq0r]with the two DAGs, with and without the unpersist() call. I call unpersist in order to prevent OoM during the join. From what I understand even though all the DataFrames come from df, unpersisting df after doing the selects shouldn't ignore the cache call, right? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29033) Always use CreateNamedStructUnsafe, the UnsafeRow-based version of the CreateNamedStruct codepath
[ https://issues.apache.org/jira/browse/SPARK-29033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-29033: --- Summary: Always use CreateNamedStructUnsafe, the UnsafeRow-based version of the CreateNamedStruct codepath (was: Always use CreateNamedStructUnsafe codepath) > Always use CreateNamedStructUnsafe, the UnsafeRow-based version of the > CreateNamedStruct codepath > - > > Key: SPARK-29033 > URL: https://issues.apache.org/jira/browse/SPARK-29033 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > > Spark 2.x has two separate implementations of the "create named struct" > expression: regular {{CreateNamedStruct}} and {{CreateNamedStructUnsafe}}. > The "unsafe" version was added in SPARK-9373 to support structs in > {{GenerateUnsafeProjection}}. These two expressions both extend the > {{CreateNameStructLike}} trait. > For Spark 3.0, I propose to always use the "unsafe" code path: this will > avoid object allocation / boxing inefficiencies in the "safe" codepath, which > is an especially big problem when generating Encoders for deeply-nested case > classes. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29015) Can not support "add jar" on JDK 11
[ https://issues.apache.org/jira/browse/SPARK-29015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29015: Description: How to reproduce: Case 1: {code:bash} export JAVA_HOME=/usr/lib/jdk-11.0.3 export PATH=$JAVA_HOME/bin:$PATH build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver export SPARK_PREPEND_CLASSES=true sbin/start-thriftserver.sh bin/beeline -u jdbc:hive2://localhost:1 {code} {noformat} 0: jdbc:hive2://localhost:1> add jar /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar; INFO : Added [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar] to class path INFO : Added resources: [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar] +-+ | result | +-+ +-+ No rows selected (0.381 seconds) 0: jdbc:hive2://localhost:1> CREATE TABLE addJar(key string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; +-+ | Result | +-+ +-+ No rows selected (0.613 seconds) 0: jdbc:hive2://localhost:1> select * from addJar; Error: Error running query: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe (state=,code=0) {noformat} was: How to reproduce: Case 1: {code:bash} export JAVA_HOME=/usr/lib/jdk-11.0.3 export PATH=$JAVA_HOME/bin:$PATH build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver export SPARK_PREPEND_CLASSES=true sbin/start-thriftserver.sh bin/beeline -u jdbc:hive2://localhost:1 {code} {noformat} 0: jdbc:hive2://localhost:1> add jar /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar; INFO : Added [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar] to class path INFO : Added resources: [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar] +-+ | result | +-+ +-+ No rows selected (0.381 seconds) 0: jdbc:hive2://localhost:1> CREATE TABLE addJar(key string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; +-+ | Result | +-+ +-+ No rows selected (0.613 seconds) 0: jdbc:hive2://localhost:1> select * from addJar; Error: Error running query: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe (state=,code=0) {noformat} Case 2: {noformat} spark-sql> add jar /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar; ADD JAR /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar spark-sql> CREATE TABLE addJar(key string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; spark-sql> select * from addJar; 19/09/07 03:06:54 ERROR SparkSQLDriver: Failed in [select * from addJar] java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:79) at org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:123) at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf$lzycompute(HiveTableScanExec.scala:101) at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf(HiveTableScanExec.scala:98) at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader$lzycompute(HiveTableScanExec.scala:110) at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader(HiveTableScanExec.scala:105) at org.apache.spark.sql.hive.execution.HiveTableScanExec.$anonfun$doExecute$1(HiveTableScanExec.scala:188) at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2488) at org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:188) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:189) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:227) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:224) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:185) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:329) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:378) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:408) at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:52) at org.apache.spark
[jira] [Commented] (SPARK-29015) Can not support "add jar" on JDK 11
[ https://issues.apache.org/jira/browse/SPARK-29015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926502#comment-16926502 ] Yuming Wang commented on SPARK-29015: - Moved {{Case 2}} to SPARK-29022. It's another issue. > Can not support "add jar" on JDK 11 > --- > > Key: SPARK-29015 > URL: https://issues.apache.org/jira/browse/SPARK-29015 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:bash} > export JAVA_HOME=/usr/lib/jdk-11.0.3 > export PATH=$JAVA_HOME/bin:$PATH > build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver > export SPARK_PREPEND_CLASSES=true > sbin/start-thriftserver.sh > bin/beeline -u jdbc:hive2://localhost:1 > {code} > {noformat} > 0: jdbc:hive2://localhost:1> add jar > /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar; > INFO : Added > [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar] > to class path > INFO : Added resources: > [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar] > +-+ > | result | > +-+ > +-+ > No rows selected (0.381 seconds) > 0: jdbc:hive2://localhost:1> CREATE TABLE addJar(key string) ROW FORMAT > SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; > +-+ > | Result | > +-+ > +-+ > No rows selected (0.613 seconds) > 0: jdbc:hive2://localhost:1> select * from addJar; > Error: Error running query: java.lang.RuntimeException: > java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe > (state=,code=0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29015) Can not support "add jar" on JDK 11
[ https://issues.apache.org/jira/browse/SPARK-29015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29015: Description: How to reproduce: {code:bash} export JAVA_HOME=/usr/lib/jdk-11.0.3 export PATH=$JAVA_HOME/bin:$PATH build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver export SPARK_PREPEND_CLASSES=true sbin/start-thriftserver.sh bin/beeline -u jdbc:hive2://localhost:1 {code} {noformat} 0: jdbc:hive2://localhost:1> add jar /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar; INFO : Added [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar] to class path INFO : Added resources: [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar] +-+ | result | +-+ +-+ No rows selected (0.381 seconds) 0: jdbc:hive2://localhost:1> CREATE TABLE addJar(key string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; +-+ | Result | +-+ +-+ No rows selected (0.613 seconds) 0: jdbc:hive2://localhost:1> select * from addJar; Error: Error running query: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe (state=,code=0) {noformat} was: How to reproduce: Case 1: {code:bash} export JAVA_HOME=/usr/lib/jdk-11.0.3 export PATH=$JAVA_HOME/bin:$PATH build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver export SPARK_PREPEND_CLASSES=true sbin/start-thriftserver.sh bin/beeline -u jdbc:hive2://localhost:1 {code} {noformat} 0: jdbc:hive2://localhost:1> add jar /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar; INFO : Added [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar] to class path INFO : Added resources: [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar] +-+ | result | +-+ +-+ No rows selected (0.381 seconds) 0: jdbc:hive2://localhost:1> CREATE TABLE addJar(key string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; +-+ | Result | +-+ +-+ No rows selected (0.613 seconds) 0: jdbc:hive2://localhost:1> select * from addJar; Error: Error running query: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe (state=,code=0) {noformat} > Can not support "add jar" on JDK 11 > --- > > Key: SPARK-29015 > URL: https://issues.apache.org/jira/browse/SPARK-29015 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:bash} > export JAVA_HOME=/usr/lib/jdk-11.0.3 > export PATH=$JAVA_HOME/bin:$PATH > build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver > export SPARK_PREPEND_CLASSES=true > sbin/start-thriftserver.sh > bin/beeline -u jdbc:hive2://localhost:1 > {code} > {noformat} > 0: jdbc:hive2://localhost:1> add jar > /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar; > INFO : Added > [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar] > to class path > INFO : Added resources: > [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar] > +-+ > | result | > +-+ > +-+ > No rows selected (0.381 seconds) > 0: jdbc:hive2://localhost:1> CREATE TABLE addJar(key string) ROW FORMAT > SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; > +-+ > | Result | > +-+ > +-+ > No rows selected (0.613 seconds) > 0: jdbc:hive2://localhost:1> select * from addJar; > Error: Error running query: java.lang.RuntimeException: > java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe > (state=,code=0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29022) SparkSQLCLI can not use 'ADD JAR' 's jar as Serder class
[ https://issues.apache.org/jira/browse/SPARK-29022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29022: -- Comment: was deleted (was: PR [https://github.com/apache/spark/pull/25729]) > SparkSQLCLI can not use 'ADD JAR' 's jar as Serder class > > > Key: SPARK-29022 > URL: https://issues.apache.org/jira/browse/SPARK-29022 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > Spark SQL CLI can't use class in jars add by SQL 'ADD JAR' > {code:java} > spark-sql> add jar > /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar; > ADD JAR > /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar > spark-sql> CREATE TABLE addJar(key string) ROW FORMAT SERDE > 'org.apache.hive.hcatalog.data.JsonSerDe'; > spark-sql> select * from addJar; > 19/09/07 03:06:54 ERROR SparkSQLDriver: Failed in [select * from addJar] > java.lang.RuntimeException: java.lang.ClassNotFoundException: > org.apache.hive.hcatalog.data.JsonSerDe > at > org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:79) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:123) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf$lzycompute(HiveTableScanExec.scala:101) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf(HiveTableScanExec.scala:98) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader$lzycompute(HiveTableScanExec.scala:110) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader(HiveTableScanExec.scala:105) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.$anonfun$doExecute$1(HiveTableScanExec.scala:188) > at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2488) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:188) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:189) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:227) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:224) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:185) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:329) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:378) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:408) > at > org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:52) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:367) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:272) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:920) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:179) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:202) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:89) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:999) > at org.ap
[jira] [Created] (SPARK-29036) SparkThriftServer may can't cancel job after call a cancel before start.
angerszhu created SPARK-29036: - Summary: SparkThriftServer may can't cancel job after call a cancel before start. Key: SPARK-29036 URL: https://issues.apache.org/jira/browse/SPARK-29036 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29036) SparkThriftServer may can't cancel job after call a cancel before start.
[ https://issues.apache.org/jira/browse/SPARK-29036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29036: -- Description: Disscuss in [https://github.com/apache/spark/pull/25611] > SparkThriftServer may can't cancel job after call a cancel before start. > > > Key: SPARK-29036 > URL: https://issues.apache.org/jira/browse/SPARK-29036 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > Disscuss in [https://github.com/apache/spark/pull/25611] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29009) Returning pojo from udf not working
[ https://issues.apache.org/jira/browse/SPARK-29009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926552#comment-16926552 ] Tomasz Belina commented on SPARK-29009: --- I've dig a little dipper into source code and it looks like only Row and simple types are supported. I consider this issue as a bug because this peace of code: {code:java} Dataset test= spark.createDataFrame( Arrays.asList( new Movie("movie1",2323d,"1212"), new Movie("movie2",2323d,"1212"), new Movie("movie3",2323d,"1212"), new Movie("movie4",2323d,"1212")), Movie.class); {code} works perfectly well and it means that spark is perfectly able to handle pojos and convert it into Row in same cases. I was surprised that in case of udf conversion into Row is not applied automatically. Additionally documentation for udf is not very extensive so it quite hard distinguish what is a bug and what is a feature. Simple checking if given type of value returned by udf is supported or not would be very helpful. > Returning pojo from udf not working > --- > > Key: SPARK-29009 > URL: https://issues.apache.org/jira/browse/SPARK-29009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Tomasz Belina >Priority: Major > > It looks like spark is unable to construct row from pojo returned from udf. > Give POJO: > {code:java} > public class SegmentStub { > private int id; > private Date statusDateTime; > private int healthPointRatio; > } > {code} > Registration of the UDF: > {code:java} > public class ParseResultsUdf { > public String registerUdf(SparkSession sparkSession) { > Encoder encoder = Encoders.bean(SegmentStub.class); > final StructType schema = encoder.schema(); > sparkSession.udf().register(UDF_NAME, > (UDF2) (s, s2) -> new > SegmentStub(1, Date.valueOf(LocalDate.now()), 2), > schema > ); > return UDF_NAME; > } > } > {code} > Test code: > {code:java} > List strings = Arrays.asList(new String[]{"one", "two"},new > String[]{"3", "4"}); > JavaRDD rowJavaRDD = > sparkContext.parallelize(strings).map(RowFactory::create); > StructType schema = DataTypes > .createStructType(new StructField[] { > DataTypes.createStructField("foe1", DataTypes.StringType, false), > DataTypes.createStructField("foe2", > DataTypes.StringType, false) }); > Dataset dataFrame = > sparkSession.sqlContext().createDataFrame(rowJavaRDD, schema); > Seq columnSeq = new Set.Set2<>(col("foe1"), > col("foe2")).toSeq(); > dataFrame.select(callUDF(udfName, columnSeq)).show(); > {code} > throws exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: The value (SegmentStub(id=1, > statusDateTime=2019-09-06, healthPointRatio=2)) of the type (udf.SegmentStub) > cannot be converted to struct > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396) > ... 21 more > } > {code} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29009) Returning pojo from udf not working
[ https://issues.apache.org/jira/browse/SPARK-29009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926554#comment-16926554 ] Tomasz Belina commented on SPARK-29009: --- POJO is fine - I've just paste only part of the class and it works perfectly well in case of {{createDataFrame. BTW - automatic conversion from PJO to row is only partly supported in case of }}_{{createDataFrame}}{{.}}_ {{I've discovered this bug: }}{{SPARK-25654.}} > Returning pojo from udf not working > --- > > Key: SPARK-29009 > URL: https://issues.apache.org/jira/browse/SPARK-29009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Tomasz Belina >Priority: Major > > It looks like spark is unable to construct row from pojo returned from udf. > Give POJO: > {code:java} > public class SegmentStub { > private int id; > private Date statusDateTime; > private int healthPointRatio; > } > {code} > Registration of the UDF: > {code:java} > public class ParseResultsUdf { > public String registerUdf(SparkSession sparkSession) { > Encoder encoder = Encoders.bean(SegmentStub.class); > final StructType schema = encoder.schema(); > sparkSession.udf().register(UDF_NAME, > (UDF2) (s, s2) -> new > SegmentStub(1, Date.valueOf(LocalDate.now()), 2), > schema > ); > return UDF_NAME; > } > } > {code} > Test code: > {code:java} > List strings = Arrays.asList(new String[]{"one", "two"},new > String[]{"3", "4"}); > JavaRDD rowJavaRDD = > sparkContext.parallelize(strings).map(RowFactory::create); > StructType schema = DataTypes > .createStructType(new StructField[] { > DataTypes.createStructField("foe1", DataTypes.StringType, false), > DataTypes.createStructField("foe2", > DataTypes.StringType, false) }); > Dataset dataFrame = > sparkSession.sqlContext().createDataFrame(rowJavaRDD, schema); > Seq columnSeq = new Set.Set2<>(col("foe1"), > col("foe2")).toSeq(); > dataFrame.select(callUDF(udfName, columnSeq)).show(); > {code} > throws exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: The value (SegmentStub(id=1, > statusDateTime=2019-09-06, healthPointRatio=2)) of the type (udf.SegmentStub) > cannot be converted to struct > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396) > ... 21 more > } > {code} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29009) Returning pojo from udf not working
[ https://issues.apache.org/jira/browse/SPARK-29009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926554#comment-16926554 ] Tomasz Belina edited comment on SPARK-29009 at 9/10/19 12:03 PM: - POJO is fine - I've just paste only part of the class and it works perfectly well in case of {{createDataFrame. BTW - automatic conversion from PJO to row is only partly supported in case of _createDataFrame_}}_{{.}}_ {{I've discovered this bug: SPARK-25654.}} was (Author: tomasz.belina): POJO is fine - I've just paste only part of the class and it works perfectly well in case of {{createDataFrame. BTW - automatic conversion from PJO to row is only partly supported in case of }}_{{createDataFrame}}{{.}}_ {{I've discovered this bug: }}{{SPARK-25654.}} > Returning pojo from udf not working > --- > > Key: SPARK-29009 > URL: https://issues.apache.org/jira/browse/SPARK-29009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Tomasz Belina >Priority: Major > > It looks like spark is unable to construct row from pojo returned from udf. > Give POJO: > {code:java} > public class SegmentStub { > private int id; > private Date statusDateTime; > private int healthPointRatio; > } > {code} > Registration of the UDF: > {code:java} > public class ParseResultsUdf { > public String registerUdf(SparkSession sparkSession) { > Encoder encoder = Encoders.bean(SegmentStub.class); > final StructType schema = encoder.schema(); > sparkSession.udf().register(UDF_NAME, > (UDF2) (s, s2) -> new > SegmentStub(1, Date.valueOf(LocalDate.now()), 2), > schema > ); > return UDF_NAME; > } > } > {code} > Test code: > {code:java} > List strings = Arrays.asList(new String[]{"one", "two"},new > String[]{"3", "4"}); > JavaRDD rowJavaRDD = > sparkContext.parallelize(strings).map(RowFactory::create); > StructType schema = DataTypes > .createStructType(new StructField[] { > DataTypes.createStructField("foe1", DataTypes.StringType, false), > DataTypes.createStructField("foe2", > DataTypes.StringType, false) }); > Dataset dataFrame = > sparkSession.sqlContext().createDataFrame(rowJavaRDD, schema); > Seq columnSeq = new Set.Set2<>(col("foe1"), > col("foe2")).toSeq(); > dataFrame.select(callUDF(udfName, columnSeq)).show(); > {code} > throws exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: The value (SegmentStub(id=1, > statusDateTime=2019-09-06, healthPointRatio=2)) of the type (udf.SegmentStub) > cannot be converted to struct > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396) > ... 21 more > } > {code} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails
[ https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926563#comment-16926563 ] koert kuipers commented on SPARK-29027: --- hey the command i run is: mvn clean test -fae i am not aware of downstream changes. where/how do you see that in reactor summary? in so far i know this is spark master. to be sure i will do new clone of repo. > KafkaDelegationTokenSuite fails > --- > > Key: SPARK-29027 > URL: https://issues.apache.org/jira/browse/SPARK-29027 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: {code} > commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 > Author: Sean Owen > Date: Mon Sep 9 10:19:40 2019 -0500 > {code} > Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) >Reporter: koert kuipers >Priority: Minor > > i am seeing consistent failure of KafkaDelegationTokenSuite on master > {code} > JsonUtilsSuite: > - parsing partitions > - parsing partitionOffsets > KafkaDelegationTokenSuite: > javax.security.sasl.SaslException: Failure to initialize security context > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails)] > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125) > at > com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85) > at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48) > at > org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197) > at java.lang.Thread.run(Thread.java:748) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails) > at > sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127) > at > sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193) > at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427) > at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62) > at > sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154) > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108) > ... 12 more > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** > org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure > at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947) > at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924) > at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131) > at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93) > at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > ... > KafkaSourceOffsetSuite: > - comparison {"t":{"0":1}} <=> {"t":{"0":2}} > - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}} > - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}} > - basic serialization - deserialization > - OffsetSeqLog serialization - deserialization > - read Spark 2.1.0 offset format > {code} > {code} > [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 4.178 > s] > [INFO] Spark Project Tags . SUCCESS [ 9.373 > s] > [INFO] Spark Project Sketch ... S
[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails
[ https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926577#comment-16926577 ] koert kuipers commented on SPARK-29027: --- i am running test on my work laptop. it has kerberos client installed (e.g. i can kinit, klist, kdestroy on it). i get the same error on other laptop (ubuntu 18) and one of our build servers. they also have kerberos client installed. i tried temporarily renaming /etc/krb5.conf to something else and then the tests passed it seems. so now i suspect that a functioning kerberos client interferes with test. i will repeat the confirm this is not coincidence. > KafkaDelegationTokenSuite fails > --- > > Key: SPARK-29027 > URL: https://issues.apache.org/jira/browse/SPARK-29027 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: {code} > commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 > Author: Sean Owen > Date: Mon Sep 9 10:19:40 2019 -0500 > {code} > Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) >Reporter: koert kuipers >Priority: Minor > > i am seeing consistent failure of KafkaDelegationTokenSuite on master > {code} > JsonUtilsSuite: > - parsing partitions > - parsing partitionOffsets > KafkaDelegationTokenSuite: > javax.security.sasl.SaslException: Failure to initialize security context > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails)] > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125) > at > com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85) > at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48) > at > org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197) > at java.lang.Thread.run(Thread.java:748) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails) > at > sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127) > at > sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193) > at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427) > at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62) > at > sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154) > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108) > ... 12 more > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** > org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure > at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947) > at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924) > at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131) > at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93) > at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > ... > KafkaSourceOffsetSuite: > - comparison {"t":{"0":1}} <=> {"t":{"0":2}} > - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}} > - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}} > - basic serialization - deserialization > - OffsetSeqLog serialization - deserialization > - read Spark 2.1.0 offset format > {code} > {code} > [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-S
[jira] [Created] (SPARK-29037) [Core] Spark may duplicate results when an application aborted and rerun
feiwang created SPARK-29037: --- Summary: [Core] Spark may duplicate results when an application aborted and rerun Key: SPARK-29037 URL: https://issues.apache.org/jira/browse/SPARK-29037 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.1 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29009) Returning pojo from udf not working
[ https://issues.apache.org/jira/browse/SPARK-29009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926592#comment-16926592 ] Hyukjin Kwon commented on SPARK-29009: -- Can you cope and paste of minimised version of the class to prevent such confusion? > Returning pojo from udf not working > --- > > Key: SPARK-29009 > URL: https://issues.apache.org/jira/browse/SPARK-29009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Tomasz Belina >Priority: Major > > It looks like spark is unable to construct row from pojo returned from udf. > Give POJO: > {code:java} > public class SegmentStub { > private int id; > private Date statusDateTime; > private int healthPointRatio; > } > {code} > Registration of the UDF: > {code:java} > public class ParseResultsUdf { > public String registerUdf(SparkSession sparkSession) { > Encoder encoder = Encoders.bean(SegmentStub.class); > final StructType schema = encoder.schema(); > sparkSession.udf().register(UDF_NAME, > (UDF2) (s, s2) -> new > SegmentStub(1, Date.valueOf(LocalDate.now()), 2), > schema > ); > return UDF_NAME; > } > } > {code} > Test code: > {code:java} > List strings = Arrays.asList(new String[]{"one", "two"},new > String[]{"3", "4"}); > JavaRDD rowJavaRDD = > sparkContext.parallelize(strings).map(RowFactory::create); > StructType schema = DataTypes > .createStructType(new StructField[] { > DataTypes.createStructField("foe1", DataTypes.StringType, false), > DataTypes.createStructField("foe2", > DataTypes.StringType, false) }); > Dataset dataFrame = > sparkSession.sqlContext().createDataFrame(rowJavaRDD, schema); > Seq columnSeq = new Set.Set2<>(col("foe1"), > col("foe2")).toSeq(); > dataFrame.select(callUDF(udfName, columnSeq)).show(); > {code} > throws exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: The value (SegmentStub(id=1, > statusDateTime=2019-09-06, healthPointRatio=2)) of the type (udf.SegmentStub) > cannot be converted to struct > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396) > ... 21 more > } > {code} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29037: Summary: [Core] Spark gives duplicate result when an application was killed and rerun (was: [Core] Spark gives duplicate result when an application aborted and rerun) > [Core] Spark gives duplicate result when an application was killed and rerun > > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: feiwang >Priority: Major > > Case: > A spark application was be killed due to long-running. > Then we re-run this application, we find that spark gives duplicated result. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29037) [Core] Spark may duplicate results when an application aborted and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29037: Description: Case: A spark application was be killed due to long-running. Then we re-run this application, we find that spark gives duplicated result. > [Core] Spark may duplicate results when an application aborted and rerun > > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: feiwang >Priority: Major > > Case: > A spark application was be killed due to long-running. > Then we re-run this application, we find that spark gives duplicated result. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29037) [Core] Spark gives duplicate result when an application aborted and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29037: Summary: [Core] Spark gives duplicate result when an application aborted and rerun (was: [Core] Spark may duplicate results when an application aborted and rerun) > [Core] Spark gives duplicate result when an application aborted and rerun > - > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: feiwang >Priority: Major > > Case: > A spark application was be killed due to long-running. > Then we re-run this application, we find that spark gives duplicated result. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29027) KafkaDelegationTokenSuite fails
[ https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926577#comment-16926577 ] koert kuipers edited comment on SPARK-29027 at 9/10/19 1:02 PM: i am running test on my work laptop. it has kerberos client installed (e.g. i can kinit, klist, kdestroy on it). i get the same error on other laptop (ubuntu 18) and one of our build servers. they also have kerberos client installed. was (Author: koert): i am running test on my work laptop. it has kerberos client installed (e.g. i can kinit, klist, kdestroy on it). i get the same error on other laptop (ubuntu 18) and one of our build servers. they also have kerberos client installed. i tried temporarily renaming /etc/krb5.conf to something else and then the tests passed it seems. so now i suspect that a functioning kerberos client interferes with test. i will repeat the confirm this is not coincidence. > KafkaDelegationTokenSuite fails > --- > > Key: SPARK-29027 > URL: https://issues.apache.org/jira/browse/SPARK-29027 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: {code} > commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 > Author: Sean Owen > Date: Mon Sep 9 10:19:40 2019 -0500 > {code} > Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) >Reporter: koert kuipers >Priority: Minor > > i am seeing consistent failure of KafkaDelegationTokenSuite on master > {code} > JsonUtilsSuite: > - parsing partitions > - parsing partitionOffsets > KafkaDelegationTokenSuite: > javax.security.sasl.SaslException: Failure to initialize security context > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails)] > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125) > at > com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85) > at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48) > at > org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197) > at java.lang.Thread.run(Thread.java:748) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails) > at > sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127) > at > sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193) > at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427) > at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62) > at > sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154) > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108) > ... 12 more > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** > org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure > at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947) > at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924) > at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131) > at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93) > at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > ... > KafkaSourceOffsetSuite: > - comparison {"t":{"0":1}} <=> {"t":{"0":2}} > - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}} >
[jira] [Created] (SPARK-29038) SPIP: Support Spark Materialized View
Lantao Jin created SPARK-29038: -- Summary: SPIP: Support Spark Materialized View Key: SPARK-29038 URL: https://issues.apache.org/jira/browse/SPARK-29038 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Lantao Jin Materialized view is an important approach in DBMS to cache data to accelerate queries. By creating a materialized view through SQL, the data that can be cached is very flexible, and needs to be configured arbitrarily according to specific usage scenarios. The Materialization Manager automatically updates the cache data according to changes in detail source tables, simplifying user work. When user submit query, Spark optimizer rewrites the execution plan based on the available materialized view to determine the optimal execution plan. Details in [design doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails
[ https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926620#comment-16926620 ] koert kuipers commented on SPARK-29027: --- i am going to try running tests on a virtual machine to try to isolate what the issue could be in environment > KafkaDelegationTokenSuite fails > --- > > Key: SPARK-29027 > URL: https://issues.apache.org/jira/browse/SPARK-29027 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: {code} > commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 > Author: Sean Owen > Date: Mon Sep 9 10:19:40 2019 -0500 > {code} > Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) >Reporter: koert kuipers >Priority: Minor > > i am seeing consistent failure of KafkaDelegationTokenSuite on master > {code} > JsonUtilsSuite: > - parsing partitions > - parsing partitionOffsets > KafkaDelegationTokenSuite: > javax.security.sasl.SaslException: Failure to initialize security context > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails)] > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125) > at > com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85) > at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48) > at > org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197) > at java.lang.Thread.run(Thread.java:748) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails) > at > sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127) > at > sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193) > at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427) > at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62) > at > sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154) > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108) > ... 12 more > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** > org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure > at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947) > at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924) > at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131) > at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93) > at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > ... > KafkaSourceOffsetSuite: > - comparison {"t":{"0":1}} <=> {"t":{"0":2}} > - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}} > - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}} > - basic serialization - deserialization > - OffsetSeqLog serialization - deserialization > - read Spark 2.1.0 offset format > {code} > {code} > [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 4.178 > s] > [INFO] Spark Project Tags . SUCCESS [ 9.373 > s] > [INFO] Spark Project Sketch ... SUCCESS [ 24.586 > s] > [INFO] Spark Project Local DB . SUCCESS [ 5.456
[jira] [Resolved] (SPARK-28856) DataSourceV2: Support SHOW DATABASES
[ https://issues.apache.org/jira/browse/SPARK-28856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28856. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25601 [https://github.com/apache/spark/pull/25601] > DataSourceV2: Support SHOW DATABASES > > > Key: SPARK-28856 > URL: https://issues.apache.org/jira/browse/SPARK-28856 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > SHOW DATABASES needs to support v2 catalogs. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28856) DataSourceV2: Support SHOW DATABASES
[ https://issues.apache.org/jira/browse/SPARK-28856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-28856: --- Assignee: Terry Kim > DataSourceV2: Support SHOW DATABASES > > > Key: SPARK-28856 > URL: https://issues.apache.org/jira/browse/SPARK-28856 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > SHOW DATABASES needs to support v2 catalogs. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails
[ https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926648#comment-16926648 ] Gabor Somogyi commented on SPARK-29027: --- {quote}where/how do you see that in reactor summary?{quote} I thought I've seen additional project in the summary but revisited and it's not true. I've double checked my Mac and there I've also kerberos client installed. > KafkaDelegationTokenSuite fails > --- > > Key: SPARK-29027 > URL: https://issues.apache.org/jira/browse/SPARK-29027 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: {code} > commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 > Author: Sean Owen > Date: Mon Sep 9 10:19:40 2019 -0500 > {code} > Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) >Reporter: koert kuipers >Priority: Minor > > i am seeing consistent failure of KafkaDelegationTokenSuite on master > {code} > JsonUtilsSuite: > - parsing partitions > - parsing partitionOffsets > KafkaDelegationTokenSuite: > javax.security.sasl.SaslException: Failure to initialize security context > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails)] > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125) > at > com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85) > at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48) > at > org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197) > at java.lang.Thread.run(Thread.java:748) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails) > at > sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127) > at > sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193) > at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427) > at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62) > at > sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154) > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108) > ... 12 more > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** > org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure > at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947) > at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924) > at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131) > at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93) > at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > ... > KafkaSourceOffsetSuite: > - comparison {"t":{"0":1}} <=> {"t":{"0":2}} > - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}} > - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}} > - basic serialization - deserialization > - OffsetSeqLog serialization - deserialization > - read Spark 2.1.0 offset format > {code} > {code} > [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 4.178 > s] > [INFO] Spark Project Tags . SUCCESS [ 9.373 > s] > [INFO] Spark Project Sketch
[jira] [Comment Edited] (SPARK-29027) KafkaDelegationTokenSuite fails
[ https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926420#comment-16926420 ] Gabor Somogyi edited comment on SPARK-29027 at 9/10/19 1:38 PM: [~kabhwan] thanks for pinging. I know of this because I've suggested on the original PR to open this jira. Apart from jenkins runs (which are passing) yesterday I've started this test in a loop with sbt and maven as well but until now haven't failed. What I can think of: * The environment is significantly different from my Mac and from PR builder * The code is not vanilla Spark and has some downstream changes All in all as suggested exact environment description + debug logs would help. was (Author: gsomogyi): [~kabhwan] thanks for pinging. I know of this because I've suggested on the original PR to open this jira. Apart from jenkins runs (which are passing) yesterday I've started this test in a loop with sbt and maven as well but until now haven't failed. What I can think of: * The environment is significantly different from my MAC and from PR builder * The code is not vanilla Spark and has some downstream changes All in all as suggested exact environment description + debug logs would help. > KafkaDelegationTokenSuite fails > --- > > Key: SPARK-29027 > URL: https://issues.apache.org/jira/browse/SPARK-29027 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: {code} > commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 > Author: Sean Owen > Date: Mon Sep 9 10:19:40 2019 -0500 > {code} > Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) >Reporter: koert kuipers >Priority: Minor > > i am seeing consistent failure of KafkaDelegationTokenSuite on master > {code} > JsonUtilsSuite: > - parsing partitions > - parsing partitionOffsets > KafkaDelegationTokenSuite: > javax.security.sasl.SaslException: Failure to initialize security context > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails)] > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125) > at > com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85) > at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48) > at > org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197) > at java.lang.Thread.run(Thread.java:748) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails) > at > sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127) > at > sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193) > at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427) > at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62) > at > sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154) > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108) > ... 12 more > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** > org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure > at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947) > at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924) > at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131) > at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93) > at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243) > at > org.apache.spark.s
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926650#comment-16926650 ] Marco Gaido commented on SPARK-29038: - [~cltlfcjin] currently spark has a something similar, which is query caching, where the user can also select the level of caching performed. My undersatanding is that your proposal is to do something very similar, just with a different syntax, more DB oriented. Is my understanding correct? > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926650#comment-16926650 ] Marco Gaido edited comment on SPARK-29038 at 9/10/19 1:40 PM: -- [~cltlfcjin] currently spark has a something similar, which is query caching, where the user can also select the level of caching performed. My understanding is that your proposal is to do something very similar, just with a different syntax, more DB oriented. Is my understanding correct? was (Author: mgaido): [~cltlfcjin] currently spark has a something similar, which is query caching, where the user can also select the level of caching performed. My undersatanding is that your proposal is to do something very similar, just with a different syntax, more DB oriented. Is my understanding correct? > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926658#comment-16926658 ] angerszhu commented on SPARK-29038: --- I am doing a similar framework. It can trigger cache sub-query data of sql when it satisfy some condition, and when new sql come, it can check LogicalPlan , if have same part, rewrite LogicalPlan to use cached data. Now it support cache data in memory and alluxio,. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29037: Description: For a stage, whose tasks commit output, a task saves output to a staging dir firstly, when all tasks of this stage success, all task output under staging dir will be moved to destination dir. However, when we kill an application, which is committing tasks' output, parts of tasks' results will be kept in staging dir, which would not be cleared gracefully. Then we rerun this application and the new application will reuse this staging dir. And when the task commit stage of new application success, all task output under this staging dir, which contains parts of old application's task output , would be moved to destination dir and the result is duplicated. More common case, I think it is confused that several application running with same root path simultaneously, they will have same staging dir for same jobId. was: Case: A spark application was be killed due to long-running. Then we re-run this application, we find that spark gives duplicated result. > [Core] Spark gives duplicate result when an application was killed and rerun > > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: feiwang >Priority: Major > > For a stage, whose tasks commit output, a task saves output to a staging dir > firstly, when all tasks of this stage success, all task output under staging > dir will be moved to destination dir. > However, when we kill an application, which is committing tasks' output, > parts of tasks' results will be kept in staging dir, which would not be > cleared gracefully. > Then we rerun this application and the new application will reuse this > staging dir. > And when the task commit stage of new application success, all task output > under this staging dir, which contains parts of old application's task output > , would be moved to destination dir and the result is duplicated. > More common case, I think it is confused that several application running > with same root path simultaneously, they will have same staging dir for same > jobId. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29037: Affects Version/s: (was: 2.3.1) 2.1.0 > [Core] Spark gives duplicate result when an application was killed and rerun > > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: feiwang >Priority: Major > > For a stage, whose tasks commit output, a task saves output to a staging dir > firstly, when all tasks of this stage success, all task output under staging > dir will be moved to destination dir. > However, when we kill an application, which is committing tasks' output, > parts of tasks' results will be kept in staging dir, which would not be > cleared gracefully. > Then we rerun this application and the new application will reuse this > staging dir. > And when the task commit stage of new application success, all task output > under this staging dir, which contains parts of old application's task output > , would be moved to destination dir and the result is duplicated. > More common case, I think it is confused that several application running > with same root path simultaneously, they will have same staging dir for same > jobId. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails
[ https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926707#comment-16926707 ] koert kuipers commented on SPARK-29027: --- i tried doing tests in a virtual machine and they pass so its something in my environment (or should u say in all our corporate laptops and servers) but i have no idea what it could be right now > KafkaDelegationTokenSuite fails > --- > > Key: SPARK-29027 > URL: https://issues.apache.org/jira/browse/SPARK-29027 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: {code} > commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 > Author: Sean Owen > Date: Mon Sep 9 10:19:40 2019 -0500 > {code} > Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) >Reporter: koert kuipers >Priority: Minor > > i am seeing consistent failure of KafkaDelegationTokenSuite on master > {code} > JsonUtilsSuite: > - parsing partitions > - parsing partitionOffsets > KafkaDelegationTokenSuite: > javax.security.sasl.SaslException: Failure to initialize security context > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails)] > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125) > at > com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85) > at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48) > at > org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197) > at java.lang.Thread.run(Thread.java:748) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails) > at > sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127) > at > sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193) > at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427) > at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62) > at > sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154) > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108) > ... 12 more > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** > org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure > at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947) > at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924) > at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131) > at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93) > at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > ... > KafkaSourceOffsetSuite: > - comparison {"t":{"0":1}} <=> {"t":{"0":2}} > - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}} > - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}} > - basic serialization - deserialization > - OffsetSeqLog serialization - deserialization > - read Spark 2.1.0 offset format > {code} > {code} > [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 4.178 > s] > [INFO] Spark Project Tags . SUCCESS [ 9.373 > s] > [INFO] Spark Project Sketch ... SUCCESS [ 24.586
[jira] [Comment Edited] (SPARK-29027) KafkaDelegationTokenSuite fails
[ https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926707#comment-16926707 ] koert kuipers edited comment on SPARK-29027 at 9/10/19 2:53 PM: i tried doing tests in a virtual machine and they pass so its something in my environment (or really in all our corporate laptops and servers) but i have no idea what it could be right now was (Author: koert): i tried doing tests in a virtual machine and they pass so its something in my environment (or should u say in all our corporate laptops and servers) but i have no idea what it could be right now > KafkaDelegationTokenSuite fails > --- > > Key: SPARK-29027 > URL: https://issues.apache.org/jira/browse/SPARK-29027 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: {code} > commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 > Author: Sean Owen > Date: Mon Sep 9 10:19:40 2019 -0500 > {code} > Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) >Reporter: koert kuipers >Priority: Minor > > i am seeing consistent failure of KafkaDelegationTokenSuite on master > {code} > JsonUtilsSuite: > - parsing partitions > - parsing partitionOffsets > KafkaDelegationTokenSuite: > javax.security.sasl.SaslException: Failure to initialize security context > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails)] > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125) > at > com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85) > at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48) > at > org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197) > at java.lang.Thread.run(Thread.java:748) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails) > at > sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127) > at > sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193) > at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427) > at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62) > at > sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154) > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108) > ... 12 more > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** > org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure > at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947) > at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924) > at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131) > at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93) > at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > ... > KafkaSourceOffsetSuite: > - comparison {"t":{"0":1}} <=> {"t":{"0":2}} > - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}} > - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}} > - basic serialization - deserialization > - OffsetSeqLog serialization - deserialization > - read Spark 2.1.0 offset format > {code} > {code} > [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSH
[jira] [Updated] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29037: Description: For a stage, whose tasks commit output, a task saves output to a staging dir firstly, when all tasks of this stage success, all task output under staging dir will be moved to destination dir. However, when we kill an application, which is committing tasks' output, parts of tasks' results will be kept in staging dir, which would not be cleared gracefully. Then we rerun this application and the new application will reuse this staging dir. And when the task commit stage of new application success, all task output under this staging dir, which contains parts of old application's task output , would be moved to destination dir and the result is duplicated. was: For a stage, whose tasks commit output, a task saves output to a staging dir firstly, when all tasks of this stage success, all task output under staging dir will be moved to destination dir. However, when we kill an application, which is committing tasks' output, parts of tasks' results will be kept in staging dir, which would not be cleared gracefully. Then we rerun this application and the new application will reuse this staging dir. And when the task commit stage of new application success, all task output under this staging dir, which contains parts of old application's task output , would be moved to destination dir and the result is duplicated. More common case, I think it is confused that several application running with same root path simultaneously, they will have same staging dir for same jobId. > [Core] Spark gives duplicate result when an application was killed and rerun > > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: feiwang >Priority: Major > > For a stage, whose tasks commit output, a task saves output to a staging dir > firstly, when all tasks of this stage success, all task output under staging > dir will be moved to destination dir. > However, when we kill an application, which is committing tasks' output, > parts of tasks' results will be kept in staging dir, which would not be > cleared gracefully. > Then we rerun this application and the new application will reuse this > staging dir. > And when the task commit stage of new application success, all task output > under this staging dir, which contains parts of old application's task output > , would be moved to destination dir and the result is duplicated. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29028) Add links to IBM Cloud Object Storage connector in cloud-integration.md
[ https://issues.apache.org/jira/browse/SPARK-29028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-29028: - Assignee: Dilip Biswal > Add links to IBM Cloud Object Storage connector in cloud-integration.md > --- > > Key: SPARK-29028 > URL: https://issues.apache.org/jira/browse/SPARK-29028 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.4 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29028) Add links to IBM Cloud Object Storage connector in cloud-integration.md
[ https://issues.apache.org/jira/browse/SPARK-29028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-29028. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25737 [https://github.com/apache/spark/pull/25737] > Add links to IBM Cloud Object Storage connector in cloud-integration.md > --- > > Key: SPARK-29028 > URL: https://issues.apache.org/jira/browse/SPARK-29028 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.4 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29039) centralize the catalog and table lookup logic
Wenchen Fan created SPARK-29039: --- Summary: centralize the catalog and table lookup logic Key: SPARK-29039 URL: https://issues.apache.org/jira/browse/SPARK-29039 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28982) Support ThriftServer GetTypeInfoOperation for Spark's own type
[ https://issues.apache.org/jira/browse/SPARK-28982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-28982. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25694 [https://github.com/apache/spark/pull/25694] > Support ThriftServer GetTypeInfoOperation for Spark's own type > -- > > Key: SPARK-28982 > URL: https://issues.apache.org/jira/browse/SPARK-28982 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.0.0 > > > Currently {{!typeinfo}} returns INTERVAL_YEAR_MONTH, INTERVAL_DAY_TIME, > ARRAY, MAP, STRUCT, UNIONTYPE and USER_DEFINED, all of which Spark turns into > string. > Maybe we should make SparkGetTypeInfoOperation, to exclude types which we > don't support? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28982) Support ThriftServer GetTypeInfoOperation for Spark's own type
[ https://issues.apache.org/jira/browse/SPARK-28982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-28982: --- Assignee: angerszhu > Support ThriftServer GetTypeInfoOperation for Spark's own type > -- > > Key: SPARK-28982 > URL: https://issues.apache.org/jira/browse/SPARK-28982 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > Currently {{!typeinfo}} returns INTERVAL_YEAR_MONTH, INTERVAL_DAY_TIME, > ARRAY, MAP, STRUCT, UNIONTYPE and USER_DEFINED, all of which Spark turns into > string. > Maybe we should make SparkGetTypeInfoOperation, to exclude types which we > don't support? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails
[ https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926809#comment-16926809 ] koert kuipers commented on SPARK-29027: --- [~gsomogyi] do you use any services that require open ports perhaps? i am thinking it could be firewall issue, or host to ip mapping? > KafkaDelegationTokenSuite fails > --- > > Key: SPARK-29027 > URL: https://issues.apache.org/jira/browse/SPARK-29027 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: {code} > commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 > Author: Sean Owen > Date: Mon Sep 9 10:19:40 2019 -0500 > {code} > Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) >Reporter: koert kuipers >Priority: Minor > > i am seeing consistent failure of KafkaDelegationTokenSuite on master > {code} > JsonUtilsSuite: > - parsing partitions > - parsing partitionOffsets > KafkaDelegationTokenSuite: > javax.security.sasl.SaslException: Failure to initialize security context > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails)] > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125) > at > com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85) > at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48) > at > org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197) > at java.lang.Thread.run(Thread.java:748) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails) > at > sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127) > at > sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193) > at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427) > at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62) > at > sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154) > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108) > ... 12 more > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** > org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure > at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947) > at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924) > at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131) > at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93) > at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > ... > KafkaSourceOffsetSuite: > - comparison {"t":{"0":1}} <=> {"t":{"0":2}} > - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}} > - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}} > - basic serialization - deserialization > - OffsetSeqLog serialization - deserialization > - read Spark 2.1.0 offset format > {code} > {code} > [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 4.178 > s] > [INFO] Spark Project Tags . SUCCESS [ 9.373 > s] > [INFO] Spark Project Sketch ... SUCCESS [ 24.586 > s] > [INFO] Spark Project Local DB ..
[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926814#comment-16926814 ] Liang-Chi Hsieh commented on SPARK-28927: - Hi [~JerryHouse], do you use any non-deterministic operations when preparing your training dataset, like sample, filtering based on random number, etc.? > ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets > with 12 billion instances > --- > > Key: SPARK-28927 > URL: https://issues.apache.org/jira/browse/SPARK-28927 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Priority: Major > Attachments: image-2019-09-02-11-55-33-596.png > > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > This exception happened sometimes. And we also found that the AUC metric was > not stable when evaluating the inner product of the user factors and the item > factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 > which was not stable for production environment. > Dataset capacity: ~12 billion ratings > Here is the our code: > val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, > y._2.toFloat))) > .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class > ALSData(user:Int, item:Int, rating:Float) extends
[jira] [Created] (SPARK-29040) Support pyspark.createDataFrame from a pyarrow.Table
Bryan Cutler created SPARK-29040: Summary: Support pyspark.createDataFrame from a pyarrow.Table Key: SPARK-29040 URL: https://issues.apache.org/jira/browse/SPARK-29040 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Bryan Cutler PySpark {{createDataFrame}} currently supports creating a spark DataFrame from Pandas, using Arrow if enabled. This could be extended to accept a {{pyarrow.Table}} which has the added benefit of being able to efficiently use columns with nested struct types. It is possible to convert a pyarrow.Table with nested columns into a pandas.DataFrame, but the data becomes dictionaries, and is not a performant way to parallelize the data. Time/Date columns would need to be handled specially, since pyspark currently uses pandas to convert Arrow data of these types to the required Spark internal format. This follows from a mailing list discussion at http://apache-spark-user-list.1001560.n3.nabble.com/question-about-pyarrow-Table-to-pyspark-DataFrame-conversion-td36110.html -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29029) PhysicalOperation.collectProjectsAndFilters should use AttributeMap while substituting aliases
[ https://issues.apache.org/jira/browse/SPARK-29029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikita Konda updated SPARK-29029: - Component/s: SQL > PhysicalOperation.collectProjectsAndFilters should use AttributeMap while > substituting aliases > -- > > Key: SPARK-29029 > URL: https://issues.apache.org/jira/browse/SPARK-29029 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.3.0 >Reporter: Nikita Konda >Priority: Major > > We have a specific use case where in we are trying insert a custom logical > operator in our logical plan to avoid some of the Spark’s optimization rules. > However, we remove this logical operator as part of custom optimization rule > before we send this to SparkStrategies. > However, we are hitting issue in the following scenario: > Analyzed plan: > {code:java} > [1] Project [userid#0] > +- [2] SubqueryAlias tmp6 >+- [3] Project [videoid#47L, avebitrate#2, userid#0] > +- [4] Filter NOT (videoid#47L = cast(30 as bigint)) > +- [5] SubqueryAlias tmp5 > +- [6] CustomBarrier >+- [7] Project [videoid#47L, avebitrate#2, userid#0] > +- [8] Filter (avebitrate#2 < 10) > +- [9] SubqueryAlias tmp3 > +- [10] Project [avebitrate#2, factorial(videoid#1) > AS videoid#47L, userid#0] >+- [11] SubqueryAlias tmp2 > +- [12] Project [userid#0, videoid#1, > avebitrate#2] > +- [13] SubqueryAlias tmp1 > +- [14] Project [userid#0, videoid#1, > avebitrate#2] >+- [15] SubqueryAlias views > +- [16] > Relation[userid#0,videoid#1,avebitrate#2] > {code} > > Optimized Plan: > {code:java} > [1] Project [userid#0] > +- [2] Filter (isnotnull(videoid#47L) && NOT (videoid#47L = 30)) >+- [3] Project [factorial(videoid#1) AS videoid#47L, userid#0] > +- [4] Filter (isnotnull(avebitrate#2) && (avebitrate#2 < 10)) > +- [5] Relation[userid#0,videoid#1,avebitrate#2] > {code} > > When this plan is passed into *PhysicalOperation* in *DataSourceStrategy*, > the collectProjectsAndFilters collects filters as > List[[+AttributeReference("videoid#47L"), > AttributeReference("avebitrate#2")]+|#47L), > AttributeReference(avebitrate#2)]. However, at this stage the base relation > only has videoid#1 and hence it throws exception saying *key not found: > videoid#47L.* > On looking further, noticed that the alias map in > *PhysicalOperation.substitute* does have the entry with key *videoid#47L* -> > Aliases Map((videoid#47L, factorial(videoid#1))). However, the substitute > alias is not substituting the expression for alias videoid#47L because they > differ in qualifier parameter. > Attribute key in Alias: AttributeReference("videoid", LongType, nullable = > true)(ExprId(47, _), *"None"*) > Attribute in Filter condition: AttributeReference("videoid", LongType, > nullable = true)(ExprId(47, _), *"Some(tmp5)"*) > Both differ only in the qualifier, however for alias map if we use > AttributeMap instead of Map[Attribute, Expression], we can get rid of the > above issue. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29014) DataSourceV2: Clean up current, default, and session catalog uses
[ https://issues.apache.org/jira/browse/SPARK-29014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926889#comment-16926889 ] Ryan Blue commented on SPARK-29014: --- [~cloud_fan], why does this require a major refactor? It would be best to keep the implementation of this as small as possible and not tie it to other work. > DataSourceV2: Clean up current, default, and session catalog uses > - > > Key: SPARK-29014 > URL: https://issues.apache.org/jira/browse/SPARK-29014 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Blocker > > Catalog tracking in DSv2 has evolved since the initial changes went in. We > need to make sure that handling is consistent across plans using the latest > rules: > * The _current_ catalog should be used when no catalog is specified > * The _default_ catalog is the catalog _current_ is initialized to > * If the _default_ catalog is not set, then it is the built-in Spark session > catalog, which will be called `spark_catalog` (This is the v2 session catalog) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28902) Spark ML Pipeline with nested Pipelines fails to load when saved from Python
[ https://issues.apache.org/jira/browse/SPARK-28902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927074#comment-16927074 ] Junichi Koizumi commented on SPARK-28902: --- Since, versions aren't the main concern here should I create a PR ? > Spark ML Pipeline with nested Pipelines fails to load when saved from Python > > > Key: SPARK-28902 > URL: https://issues.apache.org/jira/browse/SPARK-28902 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 >Reporter: Saif Addin >Priority: Minor > > Hi, this error is affecting a bunch of our nested use cases. > Saving a *PipelineModel* with one of its stages being another > *PipelineModel*, fails when loading it from Scala if it is saved in Python. > *Python side:* > > {code:java} > from pyspark.ml import Pipeline > from pyspark.ml.feature import Tokenizer > t = Tokenizer() > p = Pipeline().setStages([t]) > d = spark.createDataFrame([["Hello Peter Parker"]]) > pm = p.fit(d) > np = Pipeline().setStages([pm]) > npm = np.fit(d) > npm.write().save('./npm_test') > {code} > > > *Scala side:* > > {code:java} > scala> import org.apache.spark.ml.PipelineModel > scala> val pp = PipelineModel.load("./npm_test") > java.lang.IllegalArgumentException: requirement failed: Error loading > metadata: Expected class name org.apache.spark.ml.PipelineModel but found > class name pyspark.ml.pipeline.PipelineModel > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:638) > at > org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616) > at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:267) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342) > at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380) > at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332) > ... 50 elided > {code} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-28902) Spark ML Pipeline with nested Pipelines fails to load when saved from Python
[ https://issues.apache.org/jira/browse/SPARK-28902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junichi Koizumi updated SPARK-28902: -- Comment: was deleted (was: Since, versions aren't the main concern here should I create a PR ? ) > Spark ML Pipeline with nested Pipelines fails to load when saved from Python > > > Key: SPARK-28902 > URL: https://issues.apache.org/jira/browse/SPARK-28902 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 >Reporter: Saif Addin >Priority: Minor > > Hi, this error is affecting a bunch of our nested use cases. > Saving a *PipelineModel* with one of its stages being another > *PipelineModel*, fails when loading it from Scala if it is saved in Python. > *Python side:* > > {code:java} > from pyspark.ml import Pipeline > from pyspark.ml.feature import Tokenizer > t = Tokenizer() > p = Pipeline().setStages([t]) > d = spark.createDataFrame([["Hello Peter Parker"]]) > pm = p.fit(d) > np = Pipeline().setStages([pm]) > npm = np.fit(d) > npm.write().save('./npm_test') > {code} > > > *Scala side:* > > {code:java} > scala> import org.apache.spark.ml.PipelineModel > scala> val pp = PipelineModel.load("./npm_test") > java.lang.IllegalArgumentException: requirement failed: Error loading > metadata: Expected class name org.apache.spark.ml.PipelineModel but found > class name pyspark.ml.pipeline.PipelineModel > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:638) > at > org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616) > at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:267) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342) > at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380) > at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332) > ... 50 elided > {code} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28902) Spark ML Pipeline with nested Pipelines fails to load when saved from Python
[ https://issues.apache.org/jira/browse/SPARK-28902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927076#comment-16927076 ] Junichi Koizumi commented on SPARK-28902: --- Since versions aren't the main concern here should I create a PR ? > Spark ML Pipeline with nested Pipelines fails to load when saved from Python > > > Key: SPARK-28902 > URL: https://issues.apache.org/jira/browse/SPARK-28902 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 >Reporter: Saif Addin >Priority: Minor > > Hi, this error is affecting a bunch of our nested use cases. > Saving a *PipelineModel* with one of its stages being another > *PipelineModel*, fails when loading it from Scala if it is saved in Python. > *Python side:* > > {code:java} > from pyspark.ml import Pipeline > from pyspark.ml.feature import Tokenizer > t = Tokenizer() > p = Pipeline().setStages([t]) > d = spark.createDataFrame([["Hello Peter Parker"]]) > pm = p.fit(d) > np = Pipeline().setStages([pm]) > npm = np.fit(d) > npm.write().save('./npm_test') > {code} > > > *Scala side:* > > {code:java} > scala> import org.apache.spark.ml.PipelineModel > scala> val pp = PipelineModel.load("./npm_test") > java.lang.IllegalArgumentException: requirement failed: Error loading > metadata: Expected class name org.apache.spark.ml.PipelineModel but found > class name pyspark.ml.pipeline.PipelineModel > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:638) > at > org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616) > at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:267) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342) > at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380) > at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332) > ... 50 elided > {code} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28902) Spark ML Pipeline with nested Pipelines fails to load when saved from Python
[ https://issues.apache.org/jira/browse/SPARK-28902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927077#comment-16927077 ] Saif Addin commented on SPARK-28902: Ah, here I thought you said you couldn't reproduce it. Gladly hoping to see this fixed :) > Spark ML Pipeline with nested Pipelines fails to load when saved from Python > > > Key: SPARK-28902 > URL: https://issues.apache.org/jira/browse/SPARK-28902 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 >Reporter: Saif Addin >Priority: Minor > > Hi, this error is affecting a bunch of our nested use cases. > Saving a *PipelineModel* with one of its stages being another > *PipelineModel*, fails when loading it from Scala if it is saved in Python. > *Python side:* > > {code:java} > from pyspark.ml import Pipeline > from pyspark.ml.feature import Tokenizer > t = Tokenizer() > p = Pipeline().setStages([t]) > d = spark.createDataFrame([["Hello Peter Parker"]]) > pm = p.fit(d) > np = Pipeline().setStages([pm]) > npm = np.fit(d) > npm.write().save('./npm_test') > {code} > > > *Scala side:* > > {code:java} > scala> import org.apache.spark.ml.PipelineModel > scala> val pp = PipelineModel.load("./npm_test") > java.lang.IllegalArgumentException: requirement failed: Error loading > metadata: Expected class name org.apache.spark.ml.PipelineModel but found > class name pyspark.ml.pipeline.PipelineModel > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:638) > at > org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616) > at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:267) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342) > at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380) > at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332) > ... 50 elided > {code} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29041) Allow createDataFrame to accept bytes as binary type
Hyukjin Kwon created SPARK-29041: Summary: Allow createDataFrame to accept bytes as binary type Key: SPARK-29041 URL: https://issues.apache.org/jira/browse/SPARK-29041 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.4, 3.0.0 Reporter: Hyukjin Kwon ``` spark.createDataFrame([[b"abcd"]], "col binary") ``` simply fails. bytes should also be able to accepted as binary type -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29041) Allow createDataFrame to accept bytes as binary type
[ https://issues.apache.org/jira/browse/SPARK-29041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29041: - Description: {code} spark.createDataFrame([[b"abcd"]], "col binary") {code} simply fails. bytes should also be able to accepted as binary type was: ``` spark.createDataFrame([[b"abcd"]], "col binary") ``` simply fails. bytes should also be able to accepted as binary type > Allow createDataFrame to accept bytes as binary type > > > Key: SPARK-29041 > URL: https://issues.apache.org/jira/browse/SPARK-29041 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4, 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > spark.createDataFrame([[b"abcd"]], "col binary") > {code} > simply fails. bytes should also be able to accepted as binary type -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29041) Allow createDataFrame to accept bytes as binary type
[ https://issues.apache.org/jira/browse/SPARK-29041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29041: - Description: {code} spark.createDataFrame([[b"abcd"]], "col binary") {code} simply fails as below: {code} {code} bytes should also be able to accepted as binary type was: {code} spark.createDataFrame([[b"abcd"]], "col binary") {code} simply fails. bytes should also be able to accepted as binary type > Allow createDataFrame to accept bytes as binary type > > > Key: SPARK-29041 > URL: https://issues.apache.org/jira/browse/SPARK-29041 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4, 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > spark.createDataFrame([[b"abcd"]], "col binary") > {code} > simply fails as below: > {code} > {code} > bytes should also be able to accepted as binary type -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29041) Allow createDataFrame to accept bytes as binary type
[ https://issues.apache.org/jira/browse/SPARK-29041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29041: - Description: {code} spark.createDataFrame([[b"abcd"]], "col binary") {code} simply fails as below: in Python 3 {code} Traceback (most recent call last): File "", line 1, in File "/.../spark/python/pyspark/sql/session.py", line 787, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 442, in _createFromLocal data = list(data) File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare verify_func(obj) File "/.../forked/spark/python/pyspark/sql/types.py", line 1403, in verify verify_value(obj) File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct verifier(v) File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify verify_value(obj) File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default verify_acceptable_types(obj) File "/.../spark/python/pyspark/sql/types.py", line 1282, in verify_acceptable_types % (dataType, obj, type(obj TypeError: field col: BinaryType can not accept object b'abcd' in type {code} in Python 2: {code} Traceback (most recent call last): File "", line 1, in File "/.../spark/python/pyspark/sql/session.py", line 787, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 442, in _createFromLocal data = list(data) File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare verify_func(obj) File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify verify_value(obj) File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct verifier(v) File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify verify_value(obj) File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default verify_acceptable_types(obj) File "/.../spark/python/pyspark/sql/types.py", line 1282, in verify_acceptable_types % (dataType, obj, type(obj TypeError: field col: BinaryType can not accept object 'abcd' in type {code} {{bytes}} should also be able to accepted as binary type was: {code} spark.createDataFrame([[b"abcd"]], "col binary") {code} simply fails as below: {code} {code} bytes should also be able to accepted as binary type > Allow createDataFrame to accept bytes as binary type > > > Key: SPARK-29041 > URL: https://issues.apache.org/jira/browse/SPARK-29041 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4, 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > spark.createDataFrame([[b"abcd"]], "col binary") > {code} > simply fails as below: > in Python 3 > {code} > Traceback (most recent call last): > File "", line 1, in > File "/.../spark/python/pyspark/sql/session.py", line 787, in > createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File "/.../spark/python/pyspark/sql/session.py", line 442, in > _createFromLocal > data = list(data) > File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare > verify_func(obj) > File "/.../forked/spark/python/pyspark/sql/types.py", line 1403, in verify > verify_value(obj) > File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct > verifier(v) > File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify > verify_value(obj) > File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default > verify_acceptable_types(obj) > File "/.../spark/python/pyspark/sql/types.py", line 1282, in > verify_acceptable_types > % (dataType, obj, type(obj > TypeError: field col: BinaryType can not accept object b'abcd' in type 'bytes'> > {code} > in Python 2: > {code} > Traceback (most recent call last): > File "", line 1, in > File "/.../spark/python/pyspark/sql/session.py", line 787, in > createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File "/.../spark/python/pyspark/sql/session.py", line 442, in > _createFromLocal > data = list(data) > File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare > verify_func(obj) > File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify > verify_value(obj) > File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct > verifier(v) > File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify > verify_value(obj) > File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default > verify_acceptable_types(obj) >
[jira] [Updated] (SPARK-29001) Print better log when process of events becomes slow
[ https://issues.apache.org/jira/browse/SPARK-29001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang updated SPARK-29001: - Summary: Print better log when process of events becomes slow (was: Print event thread stack trace when EventQueue starts to drop events) > Print better log when process of events becomes slow > > > Key: SPARK-29001 > URL: https://issues.apache.org/jira/browse/SPARK-29001 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Minor > > We shall print event thread stack trace when EventQueue starts to drop > events, this help us find out what type of events is slow. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29001) Print better log when process of events becomes slow
[ https://issues.apache.org/jira/browse/SPARK-29001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang updated SPARK-29001: - Description: We shall print better log when process of events becomes slow, to help find out what type of events is slow. (was: We shall print event thread stack trace when EventQueue starts to drop events, this help us find out what type of events is slow.) > Print better log when process of events becomes slow > > > Key: SPARK-29001 > URL: https://issues.apache.org/jira/browse/SPARK-29001 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Minor > > We shall print better log when process of events becomes slow, to help find > out what type of events is slow. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29026) Improve error message when constructor in `ScalaReflection` isn't found
[ https://issues.apache.org/jira/browse/SPARK-29026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-29026: Assignee: Mick Jermsurawong > Improve error message when constructor in `ScalaReflection` isn't found > - > > Key: SPARK-29026 > URL: https://issues.apache.org/jira/browse/SPARK-29026 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Mick Jermsurawong >Assignee: Mick Jermsurawong >Priority: Minor > > Currently, a method to get constructor parameters from a given type > `constructParams` in `ScalaReflection` will throw exception if the type has > no constructor > {code:java} > is not a term > scala.ScalaReflectionException: {code} > In the normal usage of ExpressionEncoder, this can happen if the type is > interface extending `scala.Product`. > Also, since this is a protected method, this could have been other arbitrary > types without constructor. > To reproduce the error, the following will fail when trying to get > {{Encoder[NoConstructorProductTrait]}} > {code:java} > trait NoConstructorProductTrait extends scala.Product {} {code} > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29026) Improve error message when constructor in `ScalaReflection` isn't found
[ https://issues.apache.org/jira/browse/SPARK-29026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29026. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25736 [https://github.com/apache/spark/pull/25736] > Improve error message when constructor in `ScalaReflection` isn't found > - > > Key: SPARK-29026 > URL: https://issues.apache.org/jira/browse/SPARK-29026 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Mick Jermsurawong >Assignee: Mick Jermsurawong >Priority: Minor > Fix For: 3.0.0 > > > Currently, a method to get constructor parameters from a given type > `constructParams` in `ScalaReflection` will throw exception if the type has > no constructor > {code:java} > is not a term > scala.ScalaReflectionException: {code} > In the normal usage of ExpressionEncoder, this can happen if the type is > interface extending `scala.Product`. > Also, since this is a protected method, this could have been other arbitrary > types without constructor. > To reproduce the error, the following will fail when trying to get > {{Encoder[NoConstructorProductTrait]}} > {code:java} > trait NoConstructorProductTrait extends scala.Product {} {code} > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28570) Shuffle Storage API: Use writer API in UnsafeShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-28570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-28570. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25304 [https://github.com/apache/spark/pull/25304] > Shuffle Storage API: Use writer API in UnsafeShuffleWriter > -- > > Key: SPARK-28570 > URL: https://issues.apache.org/jira/browse/SPARK-28570 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Matt Cheah >Assignee: Matt Cheah >Priority: Major > Fix For: 3.0.0 > > > Use the APIs introduced in SPARK-28209 in the UnsafeShuffleWriter. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28570) Shuffle Storage API: Use writer API in UnsafeShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-28570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-28570: -- Assignee: Matt Cheah > Shuffle Storage API: Use writer API in UnsafeShuffleWriter > -- > > Key: SPARK-28570 > URL: https://issues.apache.org/jira/browse/SPARK-28570 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Matt Cheah >Assignee: Matt Cheah >Priority: Major > > Use the APIs introduced in SPARK-28209 in the UnsafeShuffleWriter. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25157) Streaming of image files from directory
[ https://issues.apache.org/jira/browse/SPARK-25157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25157. -- Resolution: Duplicate > Streaming of image files from directory > --- > > Key: SPARK-25157 > URL: https://issues.apache.org/jira/browse/SPARK-25157 > Project: Spark > Issue Type: New Feature > Components: ML, Structured Streaming >Affects Versions: 2.3.1 >Reporter: Amit Baghel >Priority: Major > > We are doing video analytics for video streams using Spark. At present there > is no direct way to stream video frames or image files to Spark and process > them using Structured Streaming and Dataset. We are using Kafka to stream > images and then doing processing at spark. We need a method in Spark to > stream images from directory. Currently *{{DataStreamReader}}* doesn't > support Image files. With the introduction of > *org.apache.spark.ml.image.ImageSchema* class, we think streaming > capabilities can be added for image files. It is fine if it won't support > some of the structured streaming features as it is a binary file. This method > could be similar to *mmlspark* *streamImages* method. > [https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29042) Sampling-based RDD with unordered input should be INDETERMINATE
Liang-Chi Hsieh created SPARK-29042: --- Summary: Sampling-based RDD with unordered input should be INDETERMINATE Key: SPARK-29042 URL: https://issues.apache.org/jira/browse/SPARK-29042 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh We have found and fixed the correctness issue when RDD output is INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is order sensitive to its input. A sampling-based RDD with unordered input, should be INDETERMINATE. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler
feiwang created SPARK-29043: --- Summary: [History Server]Only one replay thread of FsHistoryProvider work because of straggler Key: SPARK-29043 URL: https://issues.apache.org/jira/browse/SPARK-29043 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29038: -- Comment: was deleted (was: I am doing a similar framework. It can trigger cache sub-query data of sql when it satisfy some condition, and when new sql come, it can check LogicalPlan , if have same part, rewrite LogicalPlan to use cached data. Now it support cache data in memory and alluxio,.) > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails
[ https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927189#comment-16927189 ] Jungtaek Lim commented on SPARK-29027: -- [~koert] Please try to mv krb5.conf to other and run the test again. If it works, please find "EXAMPLE.COM" is defined as realm in krb5.conf. > KafkaDelegationTokenSuite fails > --- > > Key: SPARK-29027 > URL: https://issues.apache.org/jira/browse/SPARK-29027 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: {code} > commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 > Author: Sean Owen > Date: Mon Sep 9 10:19:40 2019 -0500 > {code} > Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) >Reporter: koert kuipers >Priority: Minor > > i am seeing consistent failure of KafkaDelegationTokenSuite on master > {code} > JsonUtilsSuite: > - parsing partitions > - parsing partitionOffsets > KafkaDelegationTokenSuite: > javax.security.sasl.SaslException: Failure to initialize security context > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails)] > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125) > at > com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85) > at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48) > at > org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197) > at java.lang.Thread.run(Thread.java:748) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails) > at > sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127) > at > sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193) > at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427) > at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62) > at > sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154) > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108) > ... 12 more > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** > org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure > at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947) > at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924) > at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131) > at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93) > at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > ... > KafkaSourceOffsetSuite: > - comparison {"t":{"0":1}} <=> {"t":{"0":2}} > - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}} > - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}} > - basic serialization - deserialization > - OffsetSeqLog serialization - deserialization > - read Spark 2.1.0 offset format > {code} > {code} > [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 4.178 > s] > [INFO] Spark Project Tags . SUCCESS [ 9.373 > s] > [INFO] Spark Project Sketch ... SUCCESS [ 24.586 > s] > [INFO] Spark Project Local DB ...
[jira] [Comment Edited] (SPARK-29027) KafkaDelegationTokenSuite fails
[ https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927189#comment-16927189 ] Jungtaek Lim edited comment on SPARK-29027 at 9/11/19 1:59 AM: --- [~koert] Please try to mv krb5.conf to other and run the test again. If it works, please find "EXAMPLE.COM" is defined as realm in krb5.conf, as MiniKdc seems to use it for default configuration. was (Author: kabhwan): [~koert] Please try to mv krb5.conf to other and run the test again. If it works, please find "EXAMPLE.COM" is defined as realm in krb5.conf. > KafkaDelegationTokenSuite fails > --- > > Key: SPARK-29027 > URL: https://issues.apache.org/jira/browse/SPARK-29027 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: {code} > commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 > Author: Sean Owen > Date: Mon Sep 9 10:19:40 2019 -0500 > {code} > Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) >Reporter: koert kuipers >Priority: Minor > > i am seeing consistent failure of KafkaDelegationTokenSuite on master > {code} > JsonUtilsSuite: > - parsing partitions > - parsing partitionOffsets > KafkaDelegationTokenSuite: > javax.security.sasl.SaslException: Failure to initialize security context > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails)] > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125) > at > com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85) > at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118) > at > org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114) > at > org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48) > at > org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197) > at java.lang.Thread.run(Thread.java:748) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos credentails) > at > sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127) > at > sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193) > at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427) > at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62) > at > sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154) > at > com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108) > ... 12 more > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** > org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure > at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947) > at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924) > at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157) > at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131) > at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93) > at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > ... > KafkaSourceOffsetSuite: > - comparison {"t":{"0":1}} <=> {"t":{"0":2}} > - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}} > - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}} > - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}} > - basic serialization - deserialization > - OffsetSeqLog serialization - deserialization > - read Spark 2.1.0 offset format > {code} > {code} > [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent
[jira] [Updated] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler
[ https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29043: Attachment: screenshot-1.png > [History Server]Only one replay thread of FsHistoryProvider work because of > straggler > - > > Key: SPARK-29043 > URL: https://issues.apache.org/jira/browse/SPARK-29043 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler
[ https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29043: Description: As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for spark history server. However, there is only one replay thread work because of > [History Server]Only one replay thread of FsHistoryProvider work because of > straggler > - > > Key: SPARK-29043 > URL: https://issues.apache.org/jira/browse/SPARK-29043 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for > spark history server. > However, there is only one replay thread work because of -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler
[ https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29043: Description: As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for spark history server. However, there is only one replay thread work because of straggler. was: As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for spark history server. However, there is only one replay thread work because of > [History Server]Only one replay thread of FsHistoryProvider work because of > straggler > - > > Key: SPARK-29043 > URL: https://issues.apache.org/jira/browse/SPARK-29043 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for > spark history server. > However, there is only one replay thread work because of straggler. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29044) Resolved attribute(s) R#661751,residue#661752 missing from ipi#660814,residue#660731,exper_set#660827,R#660730,description#660815,sequence#660817,exper#660828,symbol#660
Kristine Senkane created SPARK-29044: Summary: Resolved attribute(s) R#661751,residue#661752 missing from ipi#660814,residue#660731,exper_set#660827,R#660730,description#660815,sequence#660817,exper#660828,symbol#660816 Key: SPARK-29044 URL: https://issues.apache.org/jira/browse/SPARK-29044 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.4.3 Reporter: Kristine Senkane {code:java} SELECT group_averages.* FROM group_averages NATURAL INNER JOIN ( SELECT MAX(R) AS max_R, ipi AS ipi, description AS description, symbol AS symbol, residue FROM group_averages GROUP BY ipi, description, symbol, residue ) AS all_rows_bigger_than_four WHERE all_rows_bigger_than_four.max_R >= 4.0 {code} causes, {code:java} --- Py4JJavaError Traceback (most recent call last) /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: /usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 format(target_id, ".", name), value) 329 else: Py4JJavaError: An error occurred while calling o21.sql. : org.apache.spark.sql.AnalysisException: Resolved attribute(s) R#661751,residue#661752 missing from ipi#660814,residue#660731,exper_set#660827,R#660730,description#660815,sequence#660817,exper#660828,symbol#660816 in operator !Project [ipi#660814, symbol#660816, description#660815, sequence#660817, R#661751, exper#660828, exper_set#660827, residue#661752]. Attribute(s) with the same name appear in the operation: R,residue. Please check if the right attribute(s) are used.;; Project [ipi#660546, description#660547, symbol#660548, residue#660731, group_description#660716, total_residues_detected#660809L, num_datasets#660810L, R#660811] +- Filter (max_R#661746 >= cast(4.0 as double)) +- Project [ipi#660546, description#660547, symbol#660548, residue#660731, group_description#660716, total_residues_detected#660809L, num_datasets#660810L, R#660811, max_R#661746] +- Join Inner, ipi#660546 = ipi#661747) && (description#660547 = description#661748)) && (symbol#660548 = symbol#661749)) && (residue#660731 = residue#661752)) :- SubqueryAlias `group_averages` : +- Filter (num_datasets#660810L > cast(1 as bigint)) : +- Aggregate [ipi#660546, description#660547, symbol#660548, residue#660731, exper_set#660559, group_description#660716, total_residues_detected#660809L], [ipi#660546, description#660547, symbol#660548, residue#660731, group_description#660716, total_residues_detected#660809L, count(R#660758) AS num_datasets#660810L, CASE WHEN (stddev_samp(R#660758) < (cast(0.6 as double) * avg(R#660758))) THEN avg(R#660758) ELSE CASE WHEN (min(R#660758) < cast(4 as double)) THEN min(R#660758) ELSE avg(R#660758) END END AS R#660811] :+- Project [ipi#660546, description#660547, symbol#660548, exper_set#660559, exper#660560, residue#660731, group_description#660716, R#660758, total_residues_detected#660809L] : +- Join Inner, (((ipi#660546 = ipi#660814) && (description#660547 = description#660815)) && (symbol#660548 = symbol#660816)) : :- SubqueryAlias `table_by_residue` : : +- Aggregate [ipi#660546, description#660547, symbol#660548, residue#660731, exper_set#660559, exper#660560, group_description#660716], [ipi#660546, description#660547, symbol#660548, exper_set#660559, exper#660560, residue#660731, group_description#660716, CASE WHEN (stddev_samp(R#660730) < (cast(0.6 as double) * avg(R#660730))) THEN avg(R#660730) ELSE CASE WHEN (min(R#660730) < cast(4 as double)) THEN min(R#660730) ELSE avg(R#660730) END END AS R#660758] : : +- Join Inner, (exper#660560 = Cimage link#660715) : ::- SubqueryAlias `table_by_peptide` : :: +- Project [ipi#660546, symbol#660548, description#660547, sequence#660549, R#660730, exper#660560, exper_set#660559, residue#660731] : :: +- Sort [ipi#660546 ASC NULLS FIRST], true : ::+- Aggregate [exper#660560, ipi#660546, ((instr(protein_sequence#660699, regexp_replace(sequence#660549, [.*-], )) + instr(sequence#660549, *)) - 3), symbol#660548, exper_set#660559, sp#660696, sequence#660549, charge#660551, description#660547], [ipi#660546, symbol#660548, description#660547, sequence#660549, avg(cast(IR#660553 as double)) AS R#660730, exper#660560, ex
[jira] [Created] (SPARK-29045) Test failed due to table already exists in SQLMetricsSuite
Lantao Jin created SPARK-29045: -- Summary: Test failed due to table already exists in SQLMetricsSuite Key: SPARK-29045 URL: https://issues.apache.org/jira/browse/SPARK-29045 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Lantao Jin In method [[SQLMetricsTestUtils.testMetricsDynamicPartition()]], there is a CREATE TABLE sentence without [[withTable]] block. It causes test failure if use the same table name in other unit tests. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29045) Test failed due to table already exists in SQLMetricsSuite
[ https://issues.apache.org/jira/browse/SPARK-29045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-29045: --- Description: In method {{SQLMetricsTestUtils.testMetricsDynamicPartition()}}, there is a CREATE TABLE sentence without {{withTable}} block. It causes test failure if use same table name in other unit tests. (was: In method [[SQLMetricsTestUtils.testMetricsDynamicPartition()]], there is a CREATE TABLE sentence without [[withTable]] block. It causes test failure if use the same table name in other unit tests.) > Test failed due to table already exists in SQLMetricsSuite > -- > > Key: SPARK-29045 > URL: https://issues.apache.org/jira/browse/SPARK-29045 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Minor > > In method {{SQLMetricsTestUtils.testMetricsDynamicPartition()}}, there is a > CREATE TABLE sentence without {{withTable}} block. It causes test failure if > use same table name in other unit tests. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler
[ https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29043: Description: As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for spark history server. However, there is only one replay thread work because of straggler. Let's check the code. https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547 was: As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for spark history server. However, there is only one replay thread work because of straggler. Let's check the code. https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-LL547 > [History Server]Only one replay thread of FsHistoryProvider work because of > straggler > - > > Key: SPARK-29043 > URL: https://issues.apache.org/jira/browse/SPARK-29043 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for > spark history server. > However, there is only one replay thread work because of straggler. > Let's check the code. > https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547 -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler
[ https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29043: Description: As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for spark history server. However, there is only one replay thread work because of straggler. Let's check the code. https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-LL547 was: As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for spark history server. However, there is only one replay thread work because of straggler. > [History Server]Only one replay thread of FsHistoryProvider work because of > straggler > - > > Key: SPARK-29043 > URL: https://issues.apache.org/jira/browse/SPARK-29043 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for > spark history server. > However, there is only one replay thread work because of straggler. > Let's check the code. > https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-LL547 -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler
[ https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29043: Description: As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for spark history server. However, there is only one replay thread work because of straggler. Let's check the code. https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547 There is a synchronous operation for all replay tasks. was: As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for spark history server. However, there is only one replay thread work because of straggler. Let's check the code. https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547 > [History Server]Only one replay thread of FsHistoryProvider work because of > straggler > - > > Key: SPARK-29043 > URL: https://issues.apache.org/jira/browse/SPARK-29043 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for > spark history server. > However, there is only one replay thread work because of straggler. > Let's check the code. > https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547 > There is a synchronous operation for all replay tasks. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler
[ https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927198#comment-16927198 ] feiwang commented on SPARK-29043: - I think we can change it to Asynchronous. > [History Server]Only one replay thread of FsHistoryProvider work because of > straggler > - > > Key: SPARK-29043 > URL: https://issues.apache.org/jira/browse/SPARK-29043 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for > spark history server. > However, there is only one replay thread work because of straggler. > Let's check the code. > https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547 > There is a synchronous operation for all replay tasks. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler
[ https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927198#comment-16927198 ] feiwang edited comment on SPARK-29043 at 9/11/19 2:26 AM: -- I think it is better to replay logs asynchronously. was (Author: hzfeiwang): I think we can change it to Asynchronous. > [History Server]Only one replay thread of FsHistoryProvider work because of > straggler > - > > Key: SPARK-29043 > URL: https://issues.apache.org/jira/browse/SPARK-29043 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for > spark history server. > However, there is only one replay thread work because of straggler. > Let's check the code. > https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547 > There is a synchronous operation for all replay tasks. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler
[ https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927205#comment-16927205 ] Jungtaek Lim commented on SPARK-29043: -- It's asynchronous for replaying logs: it's synchronous for waiting for replaying logs to be finished, but it wouldn't matter much. So you may want to post full thread dump to show what other threads are doing when one thread is running for replaying logs. If it really doesn't work concurrently, there might be some other place being locked. > [History Server]Only one replay thread of FsHistoryProvider work because of > straggler > - > > Key: SPARK-29043 > URL: https://issues.apache.org/jira/browse/SPARK-29043 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for > spark history server. > However, there is only one replay thread work because of straggler. > Let's check the code. > https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547 > There is a synchronous operation for all replay tasks. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927214#comment-16927214 ] Lantao Jin commented on SPARK-29038: [~mgaido] IIUC, there is no "query caching" in Spark, even no result cache. But Spark natively supports RDD-level cache. Multiple jobs can share cached RDD. The cached RDD is closer to the calculation result and requires less computation. In addition, the file system level cache such as HDFS cache or Alluxio can also load data into memory in advance, improving data processing efficiency. But materialized view actually is a technology about summaries *precalculating*. Summaries are special types of aggregate views that improve query execution times by precalculating expensive joins and aggregation operations prior to execution and storing the results in a table in the database. The query optimizer transparently rewrites the request to use the materialized view. Queries go directly to the materialized view and not to the underlying detail tables which had been materialized to storage like HDFS. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927214#comment-16927214 ] Lantao Jin edited comment on SPARK-29038 at 9/11/19 3:24 AM: - [~mgaido] IIUC, there is no "query caching" in Spark, even no result cache. But Spark natively supports RDD-level cache. Multiple jobs can share cached RDD. The cached RDD is closer to the calculation result and requires less computation. In addition, the file system level cache such as HDFS cache or Alluxio can also load data into memory in advance, improving data processing efficiency. But materialized view actually is a technology about summaries *precalculating*. Summaries are special types of aggregate views that improve query execution times by precalculating expensive joins and aggregation operations prior to execution and storing the results in a table in the database. The query optimizer transparently rewrites the request to use the materialized view. Queries go directly to the materialized view which had been persisted in storage (e.g HDFS) and not to the underlying detail tables. was (Author: cltlfcjin): [~mgaido] IIUC, there is no "query caching" in Spark, even no result cache. But Spark natively supports RDD-level cache. Multiple jobs can share cached RDD. The cached RDD is closer to the calculation result and requires less computation. In addition, the file system level cache such as HDFS cache or Alluxio can also load data into memory in advance, improving data processing efficiency. But materialized view actually is a technology about summaries *precalculating*. Summaries are special types of aggregate views that improve query execution times by precalculating expensive joins and aggregation operations prior to execution and storing the results in a table in the database. The query optimizer transparently rewrites the request to use the materialized view. Queries go directly to the materialized view and not to the underlying detail tables which had been materialized to storage like HDFS. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927217#comment-16927217 ] angerszhu commented on SPARK-29038: --- [~cltlfcjin] *precalculating, alittle like CarbonData's Data map.* *Have you implement the whole matching logic* > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927229#comment-16927229 ] Lantao Jin commented on SPARK-29038: [~angerszhuuu] By default, we use Parquet to storage the data of materialized view, but it supports all storage formats Spark supported. We have implemented most matching logic about filter, join and aggregate. But it cannot cover all scenarios, like JoinBack, since Spark current doesn't support PK or dimensions like other DBMS (oracle). > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927243#comment-16927243 ] angerszhu commented on SPARK-29038: --- I ma interested in the match about : you create a MV table q1_mv with group by `l_returnflag, l_linestatus, l_shipdate`, your query group by `l_returnflag, l_linestatus` , This may be the most complex place need to be achieved. I wanted to do this in my cache framework, but I couldn't find a good way to do it. Can i contact you with wechat. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927256#comment-16927256 ] Lantao Jin commented on SPARK-29038: [~angerszhuuu]Of course, will contact you offline > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927258#comment-16927258 ] Dilip Biswal commented on SPARK-29038: -- [~cltlfcjin] Actually i had similar question as [~mgaido]. We have been writing the SQL reference for 3.0 have recently documented {code} CACHE TABLE {code} in [https://github.com/apache/spark/pull/25532]. So in SPARK, it is possible to cache the result of a complex query involving joins, aggregates etc. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927258#comment-16927258 ] Dilip Biswal edited comment on SPARK-29038 at 9/11/19 5:09 AM: --- [~cltlfcjin] Actually i had similar question as [~mgaido]. We have been writing the SQL reference for 3.0 and have recently documented {code:java} CACHE TABLE {code} in [https://github.com/apache/spark/pull/25532]. So in SPARK, it is possible to cache the result of a complex query involving joins, aggregates etc. was (Author: dkbiswal): [~cltlfcjin] Actually i had similar question as [~mgaido]. We have been writing the SQL reference for 3.0 and have recently documented {code} CACHE TABLE {code} in [https://github.com/apache/spark/pull/25532]. So in SPARK, it is possible to cache the result of a complex query involving joins, aggregates etc. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927258#comment-16927258 ] Dilip Biswal edited comment on SPARK-29038 at 9/11/19 5:09 AM: --- [~cltlfcjin] Actually i had similar question as [~mgaido]. We have been writing the SQL reference for 3.0 and have recently documented {code} CACHE TABLE {code} in [https://github.com/apache/spark/pull/25532]. So in SPARK, it is possible to cache the result of a complex query involving joins, aggregates etc. was (Author: dkbiswal): [~cltlfcjin] Actually i had similar question as [~mgaido]. We have been writing the SQL reference for 3.0 have recently documented {code} CACHE TABLE {code} in [https://github.com/apache/spark/pull/25532]. So in SPARK, it is possible to cache the result of a complex query involving joins, aggregates etc. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927258#comment-16927258 ] Dilip Biswal edited comment on SPARK-29038 at 9/11/19 5:13 AM: --- [~cltlfcjin] Actually i had similar question as [~mgaido]. We have been writing the SQL reference for 3.0 and have recently documented {code:java} CACHE TABLE {code} in [https://github.com/apache/spark/pull/25532]. So in SPARK, it is possible to cache the result of a complex query involving joins, aggregates etc, right ? was (Author: dkbiswal): [~cltlfcjin] Actually i had similar question as [~mgaido]. We have been writing the SQL reference for 3.0 and have recently documented {code:java} CACHE TABLE {code} in [https://github.com/apache/spark/pull/25532]. So in SPARK, it is possible to cache the result of a complex query involving joins, aggregates etc. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org