date:20190910

[jira] [Commented] (SPARK-29006) Support special date/timestamp values `infinity`/`-infinity`

2019-09-10 Thread Anurag Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926372#comment-16926372
 ] 

Anurag Sharma commented on SPARK-29006:
---

[~maxgekk] Thanks, will wait for your code to be merged. 

> Support special date/timestamp values `infinity`/`-infinity`
> 
>
> Key: SPARK-29006
> URL: https://issues.apache.org/jira/browse/SPARK-29006
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> ||Input String||Valid Types||Description||
> |{{infinity}}|{{date}}, {{timestamp}}|later than all other time stamps|
> |{{-infinity}}|{{date}}, {{timestamp}}|earlier than all other time stamps|
> https://www.postgresql.org/docs/12/datatype-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29024) Ignore case while resolving time zones

2019-09-10 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-29024:
---
Summary: Ignore case while resolving time zones  (was: Support the `zulu` 
time zone)

> Ignore case while resolving time zones
> --
>
> Key: SPARK-29024
> URL: https://issues.apache.org/jira/browse/SPARK-29024
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The `zulu` time zone is checked by 
> https://github.com/apache/spark/blob/67b4329fb08fd606461aa1ac9274c4a84d15d70e/sql/core/src/test/resources/sql-tests/inputs/pgSQL/timestamp.sql#L31
>  but `getZoneId` fails on resolving it:
> {code}
> scala> getZoneId("zulu")
> java.time.zone.ZoneRulesException: Unknown time-zone ID: zulu
>   at java.time.zone.ZoneRulesProvider.getProvider(ZoneRulesProvider.java:272)
>   at java.time.zone.ZoneRulesProvider.getRules(ZoneRulesProvider.java:227)
>   at java.time.ZoneRegion.ofId(ZoneRegion.java:120)
>   at java.time.ZoneId.of(ZoneId.java:411)
>   at java.time.ZoneId.of(ZoneId.java:359)
>   at java.time.ZoneId.of(ZoneId.java:315)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.getZoneId(DateTimeUtils.scala:77)
>   ... 49 elided
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29032) Simplify Prometheus support by adding `PrometheusServlet`

2019-09-10 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-29032:
-

 Summary: Simplify Prometheus support by adding `PrometheusServlet`
 Key: SPARK-29032
 URL: https://issues.apache.org/jira/browse/SPARK-29032
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This issue aims to simplify `Prometheus` support.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29032) Simplify Prometheus support by adding PrometheusServlet

2019-09-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29032:
--
Summary: Simplify Prometheus support by adding PrometheusServlet  (was: 
Simplify Prometheus support by adding PrometheusServlet`)

> Simplify Prometheus support by adding PrometheusServlet
> ---
>
> Key: SPARK-29032
> URL: https://issues.apache.org/jira/browse/SPARK-29032
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to simplify `Prometheus` support.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29032) Simplify Prometheus support by adding PrometheusServlet`

2019-09-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29032:
--
Summary: Simplify Prometheus support by adding PrometheusServlet`  (was: 
Simplify Prometheus support by adding `PrometheusServlet`)

> Simplify Prometheus support by adding PrometheusServlet`
> 
>
> Key: SPARK-29032
> URL: https://issues.apache.org/jira/browse/SPARK-29032
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to simplify `Prometheus` support.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29032) Simplify Prometheus support by adding PrometheusServlet

2019-09-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29032:
--
Description: This issue aims to simplify `Prometheus` support in Spark 
standalone environment or K8s environment.  (was: This issue aims to simplify 
`Prometheus` support.)

> Simplify Prometheus support by adding PrometheusServlet
> ---
>
> Key: SPARK-29032
> URL: https://issues.apache.org/jira/browse/SPARK-29032
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to simplify `Prometheus` support in Spark standalone 
> environment or K8s environment.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29033) Always use CreateNamedStructUnsafe codepath

2019-09-10 Thread Josh Rosen (Jira)

Josh Rosen created SPARK-29033:
--

 Summary: Always use CreateNamedStructUnsafe codepath
 Key: SPARK-29033
 URL: https://issues.apache.org/jira/browse/SPARK-29033
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Josh Rosen
Assignee: Josh Rosen


Spark 2.x has two separate implementations of the "create named struct" 
expression: regular {{CreateNamedStruct}} and {{CreateNamedStructUnsafe}}. The 
"unsafe" version was added in SPARK-9373 to support structs in 
{{GenerateUnsafeProjection}}. These two expressions both extend the 
{{CreateNameStructLike}} trait.

For Spark 3.0, I propose to always use the "unsafe" code path: this will avoid 
object allocation / boxing inefficiencies in the "safe" codepath, which is an 
especially big problem when generating Encoders for deeply-nested structs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29033) Always use CreateNamedStructUnsafe codepath

2019-09-10 Thread Josh Rosen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-29033:
---
Issue Type: Improvement  (was: Bug)

> Always use CreateNamedStructUnsafe codepath
> ---
>
> Key: SPARK-29033
> URL: https://issues.apache.org/jira/browse/SPARK-29033
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> Spark 2.x has two separate implementations of the "create named struct" 
> expression: regular {{CreateNamedStruct}} and {{CreateNamedStructUnsafe}}. 
> The "unsafe" version was added in SPARK-9373 to support structs in 
> {{GenerateUnsafeProjection}}. These two expressions both extend the 
> {{CreateNameStructLike}} trait.
> For Spark 3.0, I propose to always use the "unsafe" code path: this will 
> avoid object allocation / boxing inefficiencies in the "safe" codepath, which 
> is an especially big problem when generating Encoders for deeply-nested 
> structs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29033) Always use CreateNamedStructUnsafe codepath

2019-09-10 Thread Josh Rosen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-29033:
---
Description: 
Spark 2.x has two separate implementations of the "create named struct" 
expression: regular {{CreateNamedStruct}} and {{CreateNamedStructUnsafe}}. The 
"unsafe" version was added in SPARK-9373 to support structs in 
{{GenerateUnsafeProjection}}. These two expressions both extend the 
{{CreateNameStructLike}} trait.

For Spark 3.0, I propose to always use the "unsafe" code path: this will avoid 
object allocation / boxing inefficiencies in the "safe" codepath, which is an 
especially big problem when generating Encoders for deeply-nested case classes.

  was:
Spark 2.x has two separate implementations of the "create named struct" 
expression: regular {{CreateNamedStruct}} and {{CreateNamedStructUnsafe}}. The 
"unsafe" version was added in SPARK-9373 to support structs in 
{{GenerateUnsafeProjection}}. These two expressions both extend the 
{{CreateNameStructLike}} trait.

For Spark 3.0, I propose to always use the "unsafe" code path: this will avoid 
object allocation / boxing inefficiencies in the "safe" codepath, which is an 
especially big problem when generating Encoders for deeply-nested structs.


> Always use CreateNamedStructUnsafe codepath
> ---
>
> Key: SPARK-29033
> URL: https://issues.apache.org/jira/browse/SPARK-29033
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> Spark 2.x has two separate implementations of the "create named struct" 
> expression: regular {{CreateNamedStruct}} and {{CreateNamedStructUnsafe}}. 
> The "unsafe" version was added in SPARK-9373 to support structs in 
> {{GenerateUnsafeProjection}}. These two expressions both extend the 
> {{CreateNameStructLike}} trait.
> For Spark 3.0, I propose to always use the "unsafe" code path: this will 
> avoid object allocation / boxing inefficiencies in the "safe" codepath, which 
> is an especially big problem when generating Encoders for deeply-nested case 
> classes.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29034) String Constants with C-style Escapes

2019-09-10 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-29034:
---

 Summary: String Constants with C-style Escapes
 Key: SPARK-29034
 URL: https://issues.apache.org/jira/browse/SPARK-29034
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


PostgreSQL also accepts "escape" string constants, which are an extension to 
the SQL standard. An escape string constant is specified by writing the letter 
{{E}} (upper or lower case) just before the opening single quote, e.g., 
{{E'foo'}}. (When continuing an escape string constant across lines, write 
{{E}} only before the first opening quote.) Within an escape string, a 
backslash character ({{\}}) begins a C-like _backslash escape_ sequence, in 
which the combination of backslash and following character(s) represent a 
special byte value, as shown in [Table 
4-1|https://www.postgresql.org/docs/9.3/sql-syntax-lexical.html#SQL-BACKSLASH-TABLE].

*Table 4-1. Backslash Escape Sequences*
||Backslash Escape Sequence||Interpretation||
|{{\b}}|backspace|
|{{\f}}|form feed|
|{{\n}}|newline|
|{{\r}}|carriage return|
|{{\t}}|tab|
|{{\}}{{o}}, {{\}}{{oo}}, {{\}}{{ooo}} ({{o}} = 0 - 7)|octal byte value|
|{{\x}}{{h}}, {{\x}}{{hh}} ({{h}} = 0 - 9, A - F)|hexadecimal byte value|
|{{\u}}{{}}, {{\U}}{{}} ({{x}} = 0 - 9, A - F)|16 or 32-bit 
hexadecimal Unicode character value|

Any other character following a backslash is taken literally. Thus, to include 
a backslash character, write two backslashes ({{\\}}). Also, a single quote can 
be included in an escape string by writing {{\'}}, in addition to the normal 
way of {{''}}.

It is your responsibility that the byte sequences you create, especially when 
using the octal or hexadecimal escapes, compose valid characters in the server 
character set encoding. When the server encoding is UTF-8, then the Unicode 
escapes or the alternative Unicode escape syntax, explained in [Section 
4.1.2.3|https://www.postgresql.org/docs/9.3/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE],
 should be used instead. (The alternative would be doing the UTF-8 encoding by 
hand and writing out the bytes, which would be very cumbersome.)

The Unicode escape syntax works fully only when the server encoding is 
{{UTF8}}. When other server encodings are used, only code points in the ASCII 
range (up to {{\u007F}}) can be specified. Both the 4-digit and the 8-digit 
form can be used to specify UTF-16 surrogate pairs to compose characters with 
code points larger than U+, although the availability of the 8-digit form 
technically makes this unnecessary. (When surrogate pairs are used when the 
server encoding is {{UTF8}}, they are first combined into a single code point 
that is then encoded in UTF-8.)
 
 
[https://www.postgresql.org/docs/11/sql-syntax-lexical.html#SQL-BACKSLASH-TABLE]
 
Example:
{code:sql}
postgres=# SET bytea_output TO escape;
SET
postgres=# SELECT E'Th\\000omas'::bytea;
   bytea

 Th\000omas
(1 row)

postgres=# SELECT 'Th\\000omas'::bytea;
bytea
-
 Th\\000omas
(1 row)
{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails

2019-09-10 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926420#comment-16926420
 ] 

Gabor Somogyi commented on SPARK-29027:
---

[~kabhwan] thanks for pinging. I know of this because I've suggested on the 
original PR to open this jira.
Apart from jenkins runs (which are passing) yesterday I've started this test in 
a loop with sbt and maven as well but until now haven't failed.

What I can think of:
* The environment is significantly different from my MAC and from PR builder
* The code is not vanilla Spark and has some downstream changes

All in all as suggested exact environment description + debug logs would help.


> KafkaDelegationTokenSuite fails
> ---
>
> Key: SPARK-29027
> URL: https://issues.apache.org/jira/browse/SPARK-29027
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code}
> commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4
> Author: Sean Owen 
> Date:   Mon Sep 9 10:19:40 2019 -0500
> {code}
> Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
>Reporter: koert kuipers
>Priority: Minor
>
> i am seeing consistent failure of KafkaDelegationTokenSuite on master
> {code}
> JsonUtilsSuite:
> - parsing partitions
> - parsing partitionOffsets
> KafkaDelegationTokenSuite:
> javax.security.sasl.SaslException: Failure to initialize security context 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125)
>   at 
> com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85)
>   at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48)
>   at 
> org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)
>   at 
> sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127)
>   at 
> sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193)
>   at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427)
>   at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62)
>   at 
> sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154)
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108)
>   ... 12 more
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED ***
>   org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
>   at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947)
>   at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924)
>   at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131)
>   at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93)
>   at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   ...
> KafkaSourceOffsetSuite:
> - comparison {"t":{"0":1}} <=> {"t":{"0":2}}
> - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}}
> - basic serialization - deserialization
> - OffsetSeqLog serialization - deserialization
> - read Spark 2.1.0 offset format
> {code}
> {code}
> [INFO] Reactor Summary fo

[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails

2019-09-10 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926423#comment-16926423
 ] 

Gabor Somogyi commented on SPARK-29027:
---

[~koert] are you guys using vanilla Spark or the code contains some downstream 
changes?

> KafkaDelegationTokenSuite fails
> ---
>
> Key: SPARK-29027
> URL: https://issues.apache.org/jira/browse/SPARK-29027
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code}
> commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4
> Author: Sean Owen 
> Date:   Mon Sep 9 10:19:40 2019 -0500
> {code}
> Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
>Reporter: koert kuipers
>Priority: Minor
>
> i am seeing consistent failure of KafkaDelegationTokenSuite on master
> {code}
> JsonUtilsSuite:
> - parsing partitions
> - parsing partitionOffsets
> KafkaDelegationTokenSuite:
> javax.security.sasl.SaslException: Failure to initialize security context 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125)
>   at 
> com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85)
>   at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48)
>   at 
> org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)
>   at 
> sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127)
>   at 
> sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193)
>   at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427)
>   at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62)
>   at 
> sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154)
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108)
>   ... 12 more
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED ***
>   org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
>   at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947)
>   at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924)
>   at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131)
>   at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93)
>   at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   ...
> KafkaSourceOffsetSuite:
> - comparison {"t":{"0":1}} <=> {"t":{"0":2}}
> - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}}
> - basic serialization - deserialization
> - OffsetSeqLog serialization - deserialization
> - read Spark 2.1.0 offset format
> {code}
> {code}
> [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  4.178 
> s]
> [INFO] Spark Project Tags . SUCCESS [  9.373 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [ 24.586 
> s]
> [INFO] Spark Project Local DB . SUCCESS [  5.456 
> s]
> [INFO] Spark

[jira] [Updated] (SPARK-26598) Fix HiveThriftServer2 set hiveconf and hivevar in every sql

2019-09-10 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-26598:

Issue Type: Bug  (was: Improvement)

> Fix HiveThriftServer2 set hiveconf and hivevar in every sql
> ---
>
> Key: SPARK-26598
> URL: https://issues.apache.org/jira/browse/SPARK-26598
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wangtao93
>Assignee: dzcxzl
>Priority: Major
> Fix For: 3.0.0
>
>
> [https://github.com/apache/spark/pull/17886,] this pr provide that 
> hiveserver2 support --haveconf  and --hivevar。But it set hiveconf and hivevar 
> in every sql in class SparkSQLOperationManager，i think this is not 
> suitable。So i make a little modify to set --hiveconf and --hivevar in class 
> SparkSQLSessionManager, it will only run once in open HiveServer2 session, 
> instead of ervery sql to init --hiveconf and --hivevar



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29031) Materialized column to accelerate queries

2019-09-10 Thread Jason Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-29031:
--
Description: 
Goals
 * Add a new SQL grammar of Materialized column
 * Implicitly rewrite SQL queries on the complex type of columns if there is a 
materialized columns for it
 * If the data type of the materialized columns is atomic type, even though the 
origin column type is in complex type, enable vectorized read and filter 
pushdown to improve performance

Example

Create a normal table
{quote}CREATE TABLE x (

name STRING,

age INT,

params STRING,

event MAP

) USING parquet;
{quote}
 

Add materialized columns to an existing table
{quote}ALTER TABLE x ADD COLUMNS (

new_age INT MATERIALIZED age + 1,

city STRING MATERIALIZED get_json_object(params, '$.city'),

label STRING MATERIALIZED event['label']

);
{quote}
 

When issue a query as below
{quote}SELECT name, age+1, get_json_object(params, '$.city'), event['label']

FROM x

WHER event['label'] = 'newuser';
{quote}
It's equivalent to
{quote}SELECT name, new_age, city, label

FROM x

WHERE label = 'newuser';
{quote}
 

The query performance improved dramatically because
 # The new query (after rewritten) will read the new column city (in string 
type) instead of read the whole map of params(in map string). Much lesser data 
are need to read
 # Vectorized read can be utilized in the new query and can not be used in the 
old one. Because vectorized read can only be enabled when all required columns 
are in atomic type
 # Filter can be pushdown. Only filters on atomic column can be pushdown. The 
original filter  event['label'] = 'newuser' is on complex column, so it can not 
be pushdown.
 # The new query do not need to parse JSON any more. JSON parse is a CPU 
intensive operation which will impact performance dramatically

 

 

 

 

 

  was:
Goals
 * Add a new SQL grammar of Materialized column
 * Implicitly rewrite SQL queries on the complex type of columns if there is a 
materialized columns for it
 * If the data type of the materialized columns is atomic type, even though the 
origin column type is in complex type, enable vectorized read and filter 
pushdown to improve performance

Example

Create a normal table
{quote}CREATE TABLE x (

name STRING,

age INT,

params STRING,

event MAP

) USING parquet;
{quote}
 

Add materialized columns to an existing table
{quote}ALTER TABLE x ADD COLUMNS (

new_age INT MATERIALIZED age + 1,

city STRING MATERIALIZED get_json_object(params, '$.city'),

label STRING MATERIALIZED event['label']

);
{quote}
 

When issue a query as below
{quote}SELECT name, age+1, get_json_object(params, '$.city'), event['label']

FROM x

WHER event['label'] = 'newuser';
{quote}
It equals to
{quote}SELECT name, new_age, city, label 

FROM x

WHERE label = 'newuser';
{quote}
 

The query performance improved dramatically because
 # The new query (after rewritten) will read the new column city (in string 
type) instead of read the whole map of params(in map string). Much lesser data 
are need to read
 # Vectorized read can be utilized in the new query and can not be used in the 
old one. Because vectorized read can only be enabled when all required columns 
are in atomic type
 # Filter can be pushdown. Only filters on atomic column can be pushdown. The 
original filter  event['label'] = 'newuser' is on complex column, so it can not 
be pushdown.
 # The new query do not need to parse JSON any more. JSON parse is a CPU 
intensive operation which will impact performance dramatically

 

 

 

 

 


> Materialized column to accelerate queries
> -
>
> Key: SPARK-29031
> URL: https://issues.apache.org/jira/browse/SPARK-29031
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Guo
>Priority: Major
>  Labels: SPIP
>
> Goals
>  * Add a new SQL grammar of Materialized column
>  * Implicitly rewrite SQL queries on the complex type of columns if there is 
> a materialized columns for it
>  * If the data type of the materialized columns is atomic type, even though 
> the origin column type is in complex type, enable vectorized read and filter 
> pushdown to improve performance
> Example
> Create a normal table
> {quote}CREATE TABLE x (
> name STRING,
> age INT,
> params STRING,
> event MAP
> ) USING parquet;
> {quote}
>  
> Add materialized columns to an existing table
> {quote}ALTER TABLE x ADD COLUMNS (
> new_age INT MATERIALIZED age + 1,
> city STRING MATERIALIZED get_json_object(params, '$.city'),
> label STRING MATERIALIZED event['label']
> );
> {quote}
>  
> When issue a query as below
> {quote}SELECT name, age+1, get_json_object(params, '$.city'), event['label']
> FROM

[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails

2019-09-10 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926427#comment-16926427
 ] 

Gabor Somogyi commented on SPARK-29027:
---

Hmmm, based on the reactor summary you've provided I see downstream changes.

> KafkaDelegationTokenSuite fails
> ---
>
> Key: SPARK-29027
> URL: https://issues.apache.org/jira/browse/SPARK-29027
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code}
> commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4
> Author: Sean Owen 
> Date:   Mon Sep 9 10:19:40 2019 -0500
> {code}
> Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
>Reporter: koert kuipers
>Priority: Minor
>
> i am seeing consistent failure of KafkaDelegationTokenSuite on master
> {code}
> JsonUtilsSuite:
> - parsing partitions
> - parsing partitionOffsets
> KafkaDelegationTokenSuite:
> javax.security.sasl.SaslException: Failure to initialize security context 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125)
>   at 
> com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85)
>   at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48)
>   at 
> org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)
>   at 
> sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127)
>   at 
> sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193)
>   at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427)
>   at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62)
>   at 
> sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154)
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108)
>   ... 12 more
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED ***
>   org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
>   at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947)
>   at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924)
>   at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131)
>   at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93)
>   at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   ...
> KafkaSourceOffsetSuite:
> - comparison {"t":{"0":1}} <=> {"t":{"0":2}}
> - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}}
> - basic serialization - deserialization
> - OffsetSeqLog serialization - deserialization
> - read Spark 2.1.0 offset format
> {code}
> {code}
> [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  4.178 
> s]
> [INFO] Spark Project Tags . SUCCESS [  9.373 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [ 24.586 
> s]
> [INFO] Spark Project Local DB . SUCCESS [  5.456 
> s]
> [INFO] Spark Project Netw

[jira] [Created] (SPARK-29035) unpersist() ignoring cache/persist()

2019-09-10 Thread Jose Silva (Jira)

Jose Silva created SPARK-29035:
--

 Summary: unpersist() ignoring cache/persist()
 Key: SPARK-29035
 URL: https://issues.apache.org/jira/browse/SPARK-29035
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.4.3
 Environment: Amazon EMR - Spark 2.4.3
Reporter: Jose Silva


Calling unpersist(), even though the DataFrame is not used anymore removes all 
the InMemoryTableScan from the DAG.

 

Here's a simplified version of the code i'm using:

 

df = spark.read(...).where(...).cache()

df_a = union(df.select(...), df.select(...), df.select(...))

df_b = df.select(...)

df_c = df.select(...)

df_d = df.select(...)

df.unpersist()

join(df_a, df_b, df_c, df_d).write()

 

 

I've created an [album |https://imgur.com/a/c1xGq0r]with the two DAGs, with and 
without the unpersist() call.

 

I call unpersist in order to prevent OoM during the join. From what I 
understand even though all the DataFrames come from df, unpersisting df after 
doing the selects shouldn't ignore the cache call, right?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29033) Always use CreateNamedStructUnsafe, the UnsafeRow-based version of the CreateNamedStruct codepath

2019-09-10 Thread Josh Rosen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-29033:
---
Summary: Always use CreateNamedStructUnsafe, the UnsafeRow-based version of 
the CreateNamedStruct codepath  (was: Always use CreateNamedStructUnsafe 
codepath)

> Always use CreateNamedStructUnsafe, the UnsafeRow-based version of the 
> CreateNamedStruct codepath
> -
>
> Key: SPARK-29033
> URL: https://issues.apache.org/jira/browse/SPARK-29033
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> Spark 2.x has two separate implementations of the "create named struct" 
> expression: regular {{CreateNamedStruct}} and {{CreateNamedStructUnsafe}}. 
> The "unsafe" version was added in SPARK-9373 to support structs in 
> {{GenerateUnsafeProjection}}. These two expressions both extend the 
> {{CreateNameStructLike}} trait.
> For Spark 3.0, I propose to always use the "unsafe" code path: this will 
> avoid object allocation / boxing inefficiencies in the "safe" codepath, which 
> is an especially big problem when generating Encoders for deeply-nested case 
> classes.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29015) Can not support "add jar" on JDK 11

2019-09-10 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29015:

Description: 
How to reproduce:
Case 1:
{code:bash}
export JAVA_HOME=/usr/lib/jdk-11.0.3
export PATH=$JAVA_HOME/bin:$PATH

build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver
export SPARK_PREPEND_CLASSES=true
sbin/start-thriftserver.sh
bin/beeline -u jdbc:hive2://localhost:1
{code}


{noformat}
0: jdbc:hive2://localhost:1> add jar 
/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar;
INFO  : Added 
[/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
 to class path
INFO  : Added resources: 
[/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
+-+
| result  |
+-+
+-+
No rows selected (0.381 seconds)
0: jdbc:hive2://localhost:1> CREATE TABLE addJar(key string) ROW FORMAT 
SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
+-+
| Result  |
+-+
+-+
No rows selected (0.613 seconds)
0: jdbc:hive2://localhost:1> select * from addJar;
Error: Error running query: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe 
(state=,code=0)
{noformat}





  was:
How to reproduce:
Case 1:
{code:bash}
export JAVA_HOME=/usr/lib/jdk-11.0.3
export PATH=$JAVA_HOME/bin:$PATH

build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver
export SPARK_PREPEND_CLASSES=true
sbin/start-thriftserver.sh
bin/beeline -u jdbc:hive2://localhost:1
{code}


{noformat}
0: jdbc:hive2://localhost:1> add jar 
/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar;
INFO  : Added 
[/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
 to class path
INFO  : Added resources: 
[/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
+-+
| result  |
+-+
+-+
No rows selected (0.381 seconds)
0: jdbc:hive2://localhost:1> CREATE TABLE addJar(key string) ROW FORMAT 
SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
+-+
| Result  |
+-+
+-+
No rows selected (0.613 seconds)
0: jdbc:hive2://localhost:1> select * from addJar;
Error: Error running query: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe 
(state=,code=0)
{noformat}

Case 2:

{noformat}
spark-sql> add jar 
/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar;
ADD JAR 
/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar
spark-sql> CREATE TABLE addJar(key string) ROW FORMAT SERDE 
'org.apache.hive.hcatalog.data.JsonSerDe';
spark-sql> select * from addJar;
19/09/07 03:06:54 ERROR SparkSQLDriver: Failed in [select * from addJar]
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.hive.hcatalog.data.JsonSerDe
at 
org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:79)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:123)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf$lzycompute(HiveTableScanExec.scala:101)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf(HiveTableScanExec.scala:98)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader$lzycompute(HiveTableScanExec.scala:110)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader(HiveTableScanExec.scala:105)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.$anonfun$doExecute$1(HiveTableScanExec.scala:188)
at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2488)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:188)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:189)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:227)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:224)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:185)
at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:329)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:378)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:408)
at 
org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:52)
at 
org.apache.spark

[jira] [Commented] (SPARK-29015) Can not support "add jar" on JDK 11

2019-09-10 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926502#comment-16926502
 ] 

Yuming Wang commented on SPARK-29015:
-

Moved {{Case 2}} to SPARK-29022. It's another issue.

> Can not support "add jar" on JDK 11
> ---
>
> Key: SPARK-29015
> URL: https://issues.apache.org/jira/browse/SPARK-29015
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:bash}
> export JAVA_HOME=/usr/lib/jdk-11.0.3
> export PATH=$JAVA_HOME/bin:$PATH
> build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver
> export SPARK_PREPEND_CLASSES=true
> sbin/start-thriftserver.sh
> bin/beeline -u jdbc:hive2://localhost:1
> {code}
> {noformat}
> 0: jdbc:hive2://localhost:1> add jar 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar;
> INFO  : Added 
> [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
>  to class path
> INFO  : Added resources: 
> [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
> +-+
> | result  |
> +-+
> +-+
> No rows selected (0.381 seconds)
> 0: jdbc:hive2://localhost:1> CREATE TABLE addJar(key string) ROW FORMAT 
> SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
> +-+
> | Result  |
> +-+
> +-+
> No rows selected (0.613 seconds)
> 0: jdbc:hive2://localhost:1> select * from addJar;
> Error: Error running query: java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe 
> (state=,code=0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29015) Can not support "add jar" on JDK 11

2019-09-10 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29015:

Description: 
How to reproduce:
{code:bash}
export JAVA_HOME=/usr/lib/jdk-11.0.3
export PATH=$JAVA_HOME/bin:$PATH

build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver
export SPARK_PREPEND_CLASSES=true
sbin/start-thriftserver.sh
bin/beeline -u jdbc:hive2://localhost:1
{code}
{noformat}
0: jdbc:hive2://localhost:1> add jar 
/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar;
INFO  : Added 
[/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
 to class path
INFO  : Added resources: 
[/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
+-+
| result  |
+-+
+-+
No rows selected (0.381 seconds)
0: jdbc:hive2://localhost:1> CREATE TABLE addJar(key string) ROW FORMAT 
SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
+-+
| Result  |
+-+
+-+
No rows selected (0.613 seconds)
0: jdbc:hive2://localhost:1> select * from addJar;
Error: Error running query: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe 
(state=,code=0)
{noformat}

  was:
How to reproduce:
Case 1:
{code:bash}
export JAVA_HOME=/usr/lib/jdk-11.0.3
export PATH=$JAVA_HOME/bin:$PATH

build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver
export SPARK_PREPEND_CLASSES=true
sbin/start-thriftserver.sh
bin/beeline -u jdbc:hive2://localhost:1
{code}


{noformat}
0: jdbc:hive2://localhost:1> add jar 
/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar;
INFO  : Added 
[/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
 to class path
INFO  : Added resources: 
[/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
+-+
| result  |
+-+
+-+
No rows selected (0.381 seconds)
0: jdbc:hive2://localhost:1> CREATE TABLE addJar(key string) ROW FORMAT 
SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
+-+
| Result  |
+-+
+-+
No rows selected (0.613 seconds)
0: jdbc:hive2://localhost:1> select * from addJar;
Error: Error running query: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe 
(state=,code=0)
{noformat}






> Can not support "add jar" on JDK 11
> ---
>
> Key: SPARK-29015
> URL: https://issues.apache.org/jira/browse/SPARK-29015
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:bash}
> export JAVA_HOME=/usr/lib/jdk-11.0.3
> export PATH=$JAVA_HOME/bin:$PATH
> build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver
> export SPARK_PREPEND_CLASSES=true
> sbin/start-thriftserver.sh
> bin/beeline -u jdbc:hive2://localhost:1
> {code}
> {noformat}
> 0: jdbc:hive2://localhost:1> add jar 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar;
> INFO  : Added 
> [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
>  to class path
> INFO  : Added resources: 
> [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
> +-+
> | result  |
> +-+
> +-+
> No rows selected (0.381 seconds)
> 0: jdbc:hive2://localhost:1> CREATE TABLE addJar(key string) ROW FORMAT 
> SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
> +-+
> | Result  |
> +-+
> +-+
> No rows selected (0.613 seconds)
> 0: jdbc:hive2://localhost:1> select * from addJar;
> Error: Error running query: java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe 
> (state=,code=0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-29022) SparkSQLCLI can not use 'ADD JAR' 's jar as Serder class

2019-09-10 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29022:
--
Comment: was deleted

(was: PR [https://github.com/apache/spark/pull/25729])

> SparkSQLCLI can not use 'ADD JAR' 's jar as Serder class
> 
>
> Key: SPARK-29022
> URL: https://issues.apache.org/jira/browse/SPARK-29022
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Spark SQL CLI can't use class in jars add by SQL 'ADD JAR'
> {code:java}
> spark-sql> add jar 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar;
> ADD JAR 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar
> spark-sql> CREATE TABLE addJar(key string) ROW FORMAT SERDE 
> 'org.apache.hive.hcatalog.data.JsonSerDe';
> spark-sql> select * from addJar;
> 19/09/07 03:06:54 ERROR SparkSQLDriver: Failed in [select * from addJar]
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> org.apache.hive.hcatalog.data.JsonSerDe
>   at 
> org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:79)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:123)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf$lzycompute(HiveTableScanExec.scala:101)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf(HiveTableScanExec.scala:98)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader$lzycompute(HiveTableScanExec.scala:110)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader(HiveTableScanExec.scala:105)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.$anonfun$doExecute$1(HiveTableScanExec.scala:188)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2488)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:188)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:189)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:227)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:224)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:185)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:329)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:378)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:408)
>   at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:52)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:367)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:272)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:920)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:179)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:202)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:89)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:999)
>   at org.ap

[jira] [Created] (SPARK-29036) SparkThriftServer may can't cancel job after call a cancel before start.

2019-09-10 Thread angerszhu (Jira)

angerszhu created SPARK-29036:
-

 Summary: SparkThriftServer may can't cancel job after call a 
cancel before start.
 Key: SPARK-29036
 URL: https://issues.apache.org/jira/browse/SPARK-29036
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29036) SparkThriftServer may can't cancel job after call a cancel before start.

2019-09-10 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29036:
--
Description: Disscuss in [https://github.com/apache/spark/pull/25611]

> SparkThriftServer may can't cancel job after call a cancel before start.
> 
>
> Key: SPARK-29036
> URL: https://issues.apache.org/jira/browse/SPARK-29036
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Disscuss in [https://github.com/apache/spark/pull/25611]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29009) Returning pojo from udf not working

2019-09-10 Thread Tomasz Belina (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926552#comment-16926552
 ] 

Tomasz Belina commented on SPARK-29009:
---

I've dig  a little dipper into source code and it looks like only Row and 
simple types are supported. I consider this issue as a bug because this peace 
of code:
{code:java}
Dataset test= spark.createDataFrame(
Arrays.asList(
new Movie("movie1",2323d,"1212"),
new Movie("movie2",2323d,"1212"),
new Movie("movie3",2323d,"1212"),
new Movie("movie4",2323d,"1212")), 
Movie.class);
{code}
works perfectly well and it means that spark is perfectly able to handle pojos 
and convert it into Row in same cases. I was surprised that in case of udf 
conversion into Row is not applied automatically. Additionally documentation 
for udf is not very extensive so it quite hard distinguish what is a bug and 
what is a feature.

Simple checking if given type of value returned by udf is supported or not 
would be very helpful.

 

> Returning pojo from udf not working
> ---
>
> Key: SPARK-29009
> URL: https://issues.apache.org/jira/browse/SPARK-29009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Tomasz Belina
>Priority: Major
>
>  It looks like spark is unable to construct row from pojo returned from udf.
> Give POJO:
> {code:java}
> public class SegmentStub {
> private int id;
> private Date statusDateTime;
> private int healthPointRatio;
> }
> {code}
> Registration of the UDF:
> {code:java}
> public class ParseResultsUdf {
> public String registerUdf(SparkSession sparkSession) {
> Encoder encoder = Encoders.bean(SegmentStub.class);
> final StructType schema = encoder.schema();
> sparkSession.udf().register(UDF_NAME,
> (UDF2) (s, s2) -> new 
> SegmentStub(1, Date.valueOf(LocalDate.now()), 2),
> schema
> );
> return UDF_NAME;
> }
> }
> {code}
> Test code:
> {code:java}
> List strings = Arrays.asList(new String[]{"one", "two"},new 
> String[]{"3", "4"});
> JavaRDD rowJavaRDD = 
> sparkContext.parallelize(strings).map(RowFactory::create);
> StructType schema = DataTypes
> .createStructType(new StructField[] { 
> DataTypes.createStructField("foe1", DataTypes.StringType, false),
> DataTypes.createStructField("foe2", 
> DataTypes.StringType, false) });
> Dataset dataFrame = 
> sparkSession.sqlContext().createDataFrame(rowJavaRDD, schema);
> Seq columnSeq = new Set.Set2<>(col("foe1"), 
> col("foe2")).toSeq();
> dataFrame.select(callUDF(udfName, columnSeq)).show();
> {code}
>  throws exception: 
> {code:java}
> Caused by: java.lang.IllegalArgumentException: The value (SegmentStub(id=1, 
> statusDateTime=2019-09-06, healthPointRatio=2)) of the type (udf.SegmentStub) 
> cannot be converted to struct
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396)
>   ... 21 more
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29009) Returning pojo from udf not working

2019-09-10 Thread Tomasz Belina (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926554#comment-16926554
 ] 

Tomasz Belina commented on SPARK-29009:
---

POJO is fine - I've just paste only part of the class and it works perfectly 
well in case of {{createDataFrame. BTW - automatic conversion from PJO to row 
is only partly supported in case of }}_{{createDataFrame}}{{.}}_ {{I've 
discovered this bug: }}{{SPARK-25654.}}

> Returning pojo from udf not working
> ---
>
> Key: SPARK-29009
> URL: https://issues.apache.org/jira/browse/SPARK-29009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Tomasz Belina
>Priority: Major
>
>  It looks like spark is unable to construct row from pojo returned from udf.
> Give POJO:
> {code:java}
> public class SegmentStub {
> private int id;
> private Date statusDateTime;
> private int healthPointRatio;
> }
> {code}
> Registration of the UDF:
> {code:java}
> public class ParseResultsUdf {
> public String registerUdf(SparkSession sparkSession) {
> Encoder encoder = Encoders.bean(SegmentStub.class);
> final StructType schema = encoder.schema();
> sparkSession.udf().register(UDF_NAME,
> (UDF2) (s, s2) -> new 
> SegmentStub(1, Date.valueOf(LocalDate.now()), 2),
> schema
> );
> return UDF_NAME;
> }
> }
> {code}
> Test code:
> {code:java}
> List strings = Arrays.asList(new String[]{"one", "two"},new 
> String[]{"3", "4"});
> JavaRDD rowJavaRDD = 
> sparkContext.parallelize(strings).map(RowFactory::create);
> StructType schema = DataTypes
> .createStructType(new StructField[] { 
> DataTypes.createStructField("foe1", DataTypes.StringType, false),
> DataTypes.createStructField("foe2", 
> DataTypes.StringType, false) });
> Dataset dataFrame = 
> sparkSession.sqlContext().createDataFrame(rowJavaRDD, schema);
> Seq columnSeq = new Set.Set2<>(col("foe1"), 
> col("foe2")).toSeq();
> dataFrame.select(callUDF(udfName, columnSeq)).show();
> {code}
>  throws exception: 
> {code:java}
> Caused by: java.lang.IllegalArgumentException: The value (SegmentStub(id=1, 
> statusDateTime=2019-09-06, healthPointRatio=2)) of the type (udf.SegmentStub) 
> cannot be converted to struct
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396)
>   ... 21 more
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29009) Returning pojo from udf not working

2019-09-10 Thread Tomasz Belina (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926554#comment-16926554
 ] 

Tomasz Belina edited comment on SPARK-29009 at 9/10/19 12:03 PM:
-

POJO is fine - I've just paste only part of the class and it works perfectly 
well in case of {{createDataFrame. BTW - automatic conversion from PJO to row 
is only partly supported in case of _createDataFrame_}}_{{.}}_ {{I've 
discovered this bug: SPARK-25654.}}


was (Author: tomasz.belina):
POJO is fine - I've just paste only part of the class and it works perfectly 
well in case of {{createDataFrame. BTW - automatic conversion from PJO to row 
is only partly supported in case of }}_{{createDataFrame}}{{.}}_ {{I've 
discovered this bug: }}{{SPARK-25654.}}

> Returning pojo from udf not working
> ---
>
> Key: SPARK-29009
> URL: https://issues.apache.org/jira/browse/SPARK-29009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Tomasz Belina
>Priority: Major
>
>  It looks like spark is unable to construct row from pojo returned from udf.
> Give POJO:
> {code:java}
> public class SegmentStub {
> private int id;
> private Date statusDateTime;
> private int healthPointRatio;
> }
> {code}
> Registration of the UDF:
> {code:java}
> public class ParseResultsUdf {
> public String registerUdf(SparkSession sparkSession) {
> Encoder encoder = Encoders.bean(SegmentStub.class);
> final StructType schema = encoder.schema();
> sparkSession.udf().register(UDF_NAME,
> (UDF2) (s, s2) -> new 
> SegmentStub(1, Date.valueOf(LocalDate.now()), 2),
> schema
> );
> return UDF_NAME;
> }
> }
> {code}
> Test code:
> {code:java}
> List strings = Arrays.asList(new String[]{"one", "two"},new 
> String[]{"3", "4"});
> JavaRDD rowJavaRDD = 
> sparkContext.parallelize(strings).map(RowFactory::create);
> StructType schema = DataTypes
> .createStructType(new StructField[] { 
> DataTypes.createStructField("foe1", DataTypes.StringType, false),
> DataTypes.createStructField("foe2", 
> DataTypes.StringType, false) });
> Dataset dataFrame = 
> sparkSession.sqlContext().createDataFrame(rowJavaRDD, schema);
> Seq columnSeq = new Set.Set2<>(col("foe1"), 
> col("foe2")).toSeq();
> dataFrame.select(callUDF(udfName, columnSeq)).show();
> {code}
>  throws exception: 
> {code:java}
> Caused by: java.lang.IllegalArgumentException: The value (SegmentStub(id=1, 
> statusDateTime=2019-09-06, healthPointRatio=2)) of the type (udf.SegmentStub) 
> cannot be converted to struct
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396)
>   ... 21 more
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails

2019-09-10 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926563#comment-16926563
 ] 

koert kuipers commented on SPARK-29027:
---

hey the command i run is:
mvn clean test -fae

i am not aware of downstream changes. where/how do you see that in reactor 
summary?
in so far i know this is spark master. to be sure i will do new clone of repo.

> KafkaDelegationTokenSuite fails
> ---
>
> Key: SPARK-29027
> URL: https://issues.apache.org/jira/browse/SPARK-29027
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code}
> commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4
> Author: Sean Owen 
> Date:   Mon Sep 9 10:19:40 2019 -0500
> {code}
> Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
>Reporter: koert kuipers
>Priority: Minor
>
> i am seeing consistent failure of KafkaDelegationTokenSuite on master
> {code}
> JsonUtilsSuite:
> - parsing partitions
> - parsing partitionOffsets
> KafkaDelegationTokenSuite:
> javax.security.sasl.SaslException: Failure to initialize security context 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125)
>   at 
> com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85)
>   at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48)
>   at 
> org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)
>   at 
> sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127)
>   at 
> sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193)
>   at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427)
>   at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62)
>   at 
> sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154)
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108)
>   ... 12 more
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED ***
>   org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
>   at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947)
>   at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924)
>   at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131)
>   at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93)
>   at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   ...
> KafkaSourceOffsetSuite:
> - comparison {"t":{"0":1}} <=> {"t":{"0":2}}
> - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}}
> - basic serialization - deserialization
> - OffsetSeqLog serialization - deserialization
> - read Spark 2.1.0 offset format
> {code}
> {code}
> [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  4.178 
> s]
> [INFO] Spark Project Tags . SUCCESS [  9.373 
> s]
> [INFO] Spark Project Sketch ... S

[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails

2019-09-10 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926577#comment-16926577
 ] 

koert kuipers commented on SPARK-29027:
---

i am running test on my work laptop. it has kerberos client installed (e.g. i 
can kinit, klist, kdestroy on it). i get the same error on other laptop (ubuntu 
18) and one of our build servers. they also have kerberos client installed. 

i tried temporarily renaming /etc/krb5.conf to something else and then the 
tests passed it seems. so now i suspect that a functioning kerberos client 
interferes with test. i will repeat the confirm this is not coincidence.

> KafkaDelegationTokenSuite fails
> ---
>
> Key: SPARK-29027
> URL: https://issues.apache.org/jira/browse/SPARK-29027
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code}
> commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4
> Author: Sean Owen 
> Date:   Mon Sep 9 10:19:40 2019 -0500
> {code}
> Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
>Reporter: koert kuipers
>Priority: Minor
>
> i am seeing consistent failure of KafkaDelegationTokenSuite on master
> {code}
> JsonUtilsSuite:
> - parsing partitions
> - parsing partitionOffsets
> KafkaDelegationTokenSuite:
> javax.security.sasl.SaslException: Failure to initialize security context 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125)
>   at 
> com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85)
>   at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48)
>   at 
> org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)
>   at 
> sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127)
>   at 
> sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193)
>   at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427)
>   at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62)
>   at 
> sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154)
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108)
>   ... 12 more
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED ***
>   org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
>   at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947)
>   at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924)
>   at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131)
>   at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93)
>   at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   ...
> KafkaSourceOffsetSuite:
> - comparison {"t":{"0":1}} <=> {"t":{"0":2}}
> - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}}
> - basic serialization - deserialization
> - OffsetSeqLog serialization - deserialization
> - read Spark 2.1.0 offset format
> {code}
> {code}
> [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-S

[jira] [Created] (SPARK-29037) [Core] Spark may duplicate results when an application aborted and rerun

2019-09-10 Thread feiwang (Jira)

feiwang created SPARK-29037:
---

 Summary: [Core] Spark may duplicate results when an application 
aborted and rerun
 Key: SPARK-29037
 URL: https://issues.apache.org/jira/browse/SPARK-29037
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29009) Returning pojo from udf not working

2019-09-10 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926592#comment-16926592
 ] 

Hyukjin Kwon commented on SPARK-29009:
--

Can you cope and paste of minimised version of the class to prevent such 
confusion?

> Returning pojo from udf not working
> ---
>
> Key: SPARK-29009
> URL: https://issues.apache.org/jira/browse/SPARK-29009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Tomasz Belina
>Priority: Major
>
>  It looks like spark is unable to construct row from pojo returned from udf.
> Give POJO:
> {code:java}
> public class SegmentStub {
> private int id;
> private Date statusDateTime;
> private int healthPointRatio;
> }
> {code}
> Registration of the UDF:
> {code:java}
> public class ParseResultsUdf {
> public String registerUdf(SparkSession sparkSession) {
> Encoder encoder = Encoders.bean(SegmentStub.class);
> final StructType schema = encoder.schema();
> sparkSession.udf().register(UDF_NAME,
> (UDF2) (s, s2) -> new 
> SegmentStub(1, Date.valueOf(LocalDate.now()), 2),
> schema
> );
> return UDF_NAME;
> }
> }
> {code}
> Test code:
> {code:java}
> List strings = Arrays.asList(new String[]{"one", "two"},new 
> String[]{"3", "4"});
> JavaRDD rowJavaRDD = 
> sparkContext.parallelize(strings).map(RowFactory::create);
> StructType schema = DataTypes
> .createStructType(new StructField[] { 
> DataTypes.createStructField("foe1", DataTypes.StringType, false),
> DataTypes.createStructField("foe2", 
> DataTypes.StringType, false) });
> Dataset dataFrame = 
> sparkSession.sqlContext().createDataFrame(rowJavaRDD, schema);
> Seq columnSeq = new Set.Set2<>(col("foe1"), 
> col("foe2")).toSeq();
> dataFrame.select(callUDF(udfName, columnSeq)).show();
> {code}
>  throws exception: 
> {code:java}
> Caused by: java.lang.IllegalArgumentException: The value (SegmentStub(id=1, 
> statusDateTime=2019-09-06, healthPointRatio=2)) of the type (udf.SegmentStub) 
> cannot be converted to struct
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396)
>   ... 21 more
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2019-09-10 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29037:

Summary: [Core] Spark gives duplicate result when an application was killed 
and rerun  (was: [Core] Spark gives duplicate result when an application 
aborted and rerun)

> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: feiwang
>Priority: Major
>
> Case:
> A spark application  was be killed due to long-running.
> Then we re-run this application, we find that spark gives duplicated result.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29037) [Core] Spark may duplicate results when an application aborted and rerun

2019-09-10 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29037:

Description: 
Case:

A spark application  was be killed due to long-running.
Then we re-run this application, we find that spark gives duplicated result.

> [Core] Spark may duplicate results when an application aborted and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: feiwang
>Priority: Major
>
> Case:
> A spark application  was be killed due to long-running.
> Then we re-run this application, we find that spark gives duplicated result.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29037) [Core] Spark gives duplicate result when an application aborted and rerun

2019-09-10 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29037:

Summary: [Core] Spark gives duplicate result when an application aborted 
and rerun  (was: [Core] Spark may duplicate results when an application aborted 
and rerun)

> [Core] Spark gives duplicate result when an application aborted and rerun
> -
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: feiwang
>Priority: Major
>
> Case:
> A spark application  was be killed due to long-running.
> Then we re-run this application, we find that spark gives duplicated result.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29027) KafkaDelegationTokenSuite fails

2019-09-10 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926577#comment-16926577
 ] 

koert kuipers edited comment on SPARK-29027 at 9/10/19 1:02 PM:


i am running test on my work laptop. it has kerberos client installed (e.g. i 
can kinit, klist, kdestroy on it). i get the same error on other laptop (ubuntu 
18) and one of our build servers. they also have kerberos client installed. 



was (Author: koert):
i am running test on my work laptop. it has kerberos client installed (e.g. i 
can kinit, klist, kdestroy on it). i get the same error on other laptop (ubuntu 
18) and one of our build servers. they also have kerberos client installed. 

i tried temporarily renaming /etc/krb5.conf to something else and then the 
tests passed it seems. so now i suspect that a functioning kerberos client 
interferes with test. i will repeat the confirm this is not coincidence.

> KafkaDelegationTokenSuite fails
> ---
>
> Key: SPARK-29027
> URL: https://issues.apache.org/jira/browse/SPARK-29027
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code}
> commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4
> Author: Sean Owen 
> Date:   Mon Sep 9 10:19:40 2019 -0500
> {code}
> Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
>Reporter: koert kuipers
>Priority: Minor
>
> i am seeing consistent failure of KafkaDelegationTokenSuite on master
> {code}
> JsonUtilsSuite:
> - parsing partitions
> - parsing partitionOffsets
> KafkaDelegationTokenSuite:
> javax.security.sasl.SaslException: Failure to initialize security context 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125)
>   at 
> com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85)
>   at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48)
>   at 
> org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)
>   at 
> sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127)
>   at 
> sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193)
>   at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427)
>   at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62)
>   at 
> sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154)
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108)
>   ... 12 more
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED ***
>   org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
>   at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947)
>   at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924)
>   at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131)
>   at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93)
>   at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   ...
> KafkaSourceOffsetSuite:
> - comparison {"t":{"0":1}} <=> {"t":{"0":2}}
> - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}}
>

[jira] [Created] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Lantao Jin (Jira)

Lantao Jin created SPARK-29038:
--

 Summary: SPIP: Support Spark Materialized View
 Key: SPARK-29038
 URL: https://issues.apache.org/jira/browse/SPARK-29038
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Lantao Jin


Materialized view is an important approach in DBMS to cache data to accelerate 
queries. By creating a materialized view through SQL, the data that can be 
cached is very flexible, and needs to be configured arbitrarily according to 
specific usage scenarios. The Materialization Manager automatically updates the 
cache data according to changes in detail source tables, simplifying user work. 
When user submit query, Spark optimizer rewrites the execution plan based on 
the available materialized view to determine the optimal execution plan.

Details in [design 
doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]





--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails

2019-09-10 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926620#comment-16926620
 ] 

koert kuipers commented on SPARK-29027:
---

i am going to try running tests on a virtual machine to try to isolate what the 
issue could be in environment

> KafkaDelegationTokenSuite fails
> ---
>
> Key: SPARK-29027
> URL: https://issues.apache.org/jira/browse/SPARK-29027
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code}
> commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4
> Author: Sean Owen 
> Date:   Mon Sep 9 10:19:40 2019 -0500
> {code}
> Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
>Reporter: koert kuipers
>Priority: Minor
>
> i am seeing consistent failure of KafkaDelegationTokenSuite on master
> {code}
> JsonUtilsSuite:
> - parsing partitions
> - parsing partitionOffsets
> KafkaDelegationTokenSuite:
> javax.security.sasl.SaslException: Failure to initialize security context 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125)
>   at 
> com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85)
>   at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48)
>   at 
> org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)
>   at 
> sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127)
>   at 
> sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193)
>   at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427)
>   at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62)
>   at 
> sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154)
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108)
>   ... 12 more
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED ***
>   org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
>   at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947)
>   at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924)
>   at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131)
>   at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93)
>   at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   ...
> KafkaSourceOffsetSuite:
> - comparison {"t":{"0":1}} <=> {"t":{"0":2}}
> - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}}
> - basic serialization - deserialization
> - OffsetSeqLog serialization - deserialization
> - read Spark 2.1.0 offset format
> {code}
> {code}
> [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  4.178 
> s]
> [INFO] Spark Project Tags . SUCCESS [  9.373 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [ 24.586 
> s]
> [INFO] Spark Project Local DB . SUCCESS [  5.456

[jira] [Resolved] (SPARK-28856) DataSourceV2: Support SHOW DATABASES

2019-09-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28856.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25601
[https://github.com/apache/spark/pull/25601]

> DataSourceV2: Support SHOW DATABASES
> 
>
> Key: SPARK-28856
> URL: https://issues.apache.org/jira/browse/SPARK-28856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> SHOW DATABASES needs to support v2 catalogs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28856) DataSourceV2: Support SHOW DATABASES

2019-09-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28856:
---

Assignee: Terry Kim

> DataSourceV2: Support SHOW DATABASES
> 
>
> Key: SPARK-28856
> URL: https://issues.apache.org/jira/browse/SPARK-28856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> SHOW DATABASES needs to support v2 catalogs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails

2019-09-10 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926648#comment-16926648
 ] 

Gabor Somogyi commented on SPARK-29027:
---

{quote}where/how do you see that in reactor summary?{quote}
I thought I've seen additional project in the summary but revisited and it's 
not true.

I've double checked my Mac and there I've also kerberos client installed.


> KafkaDelegationTokenSuite fails
> ---
>
> Key: SPARK-29027
> URL: https://issues.apache.org/jira/browse/SPARK-29027
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code}
> commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4
> Author: Sean Owen 
> Date:   Mon Sep 9 10:19:40 2019 -0500
> {code}
> Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
>Reporter: koert kuipers
>Priority: Minor
>
> i am seeing consistent failure of KafkaDelegationTokenSuite on master
> {code}
> JsonUtilsSuite:
> - parsing partitions
> - parsing partitionOffsets
> KafkaDelegationTokenSuite:
> javax.security.sasl.SaslException: Failure to initialize security context 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125)
>   at 
> com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85)
>   at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48)
>   at 
> org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)
>   at 
> sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127)
>   at 
> sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193)
>   at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427)
>   at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62)
>   at 
> sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154)
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108)
>   ... 12 more
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED ***
>   org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
>   at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947)
>   at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924)
>   at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131)
>   at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93)
>   at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   ...
> KafkaSourceOffsetSuite:
> - comparison {"t":{"0":1}} <=> {"t":{"0":2}}
> - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}}
> - basic serialization - deserialization
> - OffsetSeqLog serialization - deserialization
> - read Spark 2.1.0 offset format
> {code}
> {code}
> [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  4.178 
> s]
> [INFO] Spark Project Tags . SUCCESS [  9.373 
> s]
> [INFO] Spark Project Sketch

[jira] [Comment Edited] (SPARK-29027) KafkaDelegationTokenSuite fails

2019-09-10 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926420#comment-16926420
 ] 

Gabor Somogyi edited comment on SPARK-29027 at 9/10/19 1:38 PM:


[~kabhwan] thanks for pinging. I know of this because I've suggested on the 
original PR to open this jira.
Apart from jenkins runs (which are passing) yesterday I've started this test in 
a loop with sbt and maven as well but until now haven't failed.

What I can think of:
* The environment is significantly different from my Mac and from PR builder
* The code is not vanilla Spark and has some downstream changes

All in all as suggested exact environment description + debug logs would help.



was (Author: gsomogyi):
[~kabhwan] thanks for pinging. I know of this because I've suggested on the 
original PR to open this jira.
Apart from jenkins runs (which are passing) yesterday I've started this test in 
a loop with sbt and maven as well but until now haven't failed.

What I can think of:
* The environment is significantly different from my MAC and from PR builder
* The code is not vanilla Spark and has some downstream changes

All in all as suggested exact environment description + debug logs would help.


> KafkaDelegationTokenSuite fails
> ---
>
> Key: SPARK-29027
> URL: https://issues.apache.org/jira/browse/SPARK-29027
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code}
> commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4
> Author: Sean Owen 
> Date:   Mon Sep 9 10:19:40 2019 -0500
> {code}
> Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
>Reporter: koert kuipers
>Priority: Minor
>
> i am seeing consistent failure of KafkaDelegationTokenSuite on master
> {code}
> JsonUtilsSuite:
> - parsing partitions
> - parsing partitionOffsets
> KafkaDelegationTokenSuite:
> javax.security.sasl.SaslException: Failure to initialize security context 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125)
>   at 
> com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85)
>   at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48)
>   at 
> org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)
>   at 
> sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127)
>   at 
> sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193)
>   at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427)
>   at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62)
>   at 
> sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154)
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108)
>   ... 12 more
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED ***
>   org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
>   at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947)
>   at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924)
>   at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131)
>   at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93)
>   at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243)
>   at 
> org.apache.spark.s

[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926650#comment-16926650
 ] 

Marco Gaido commented on SPARK-29038:
-

[~cltlfcjin] currently spark has a something similar, which is query caching, 
where the user can also select the level of caching performed. My 
undersatanding is that your proposal is to do something very similar, just with 
a different syntax, more DB oriented. Is my understanding correct?

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926650#comment-16926650
 ] 

Marco Gaido edited comment on SPARK-29038 at 9/10/19 1:40 PM:
--

[~cltlfcjin] currently spark has a something similar, which is query caching, 
where the user can also select the level of caching performed. My understanding 
is that your proposal is to do something very similar, just with a different 
syntax, more DB oriented. Is my understanding correct?


was (Author: mgaido):
[~cltlfcjin] currently spark has a something similar, which is query caching, 
where the user can also select the level of caching performed. My 
undersatanding is that your proposal is to do something very similar, just with 
a different syntax, more DB oriented. Is my understanding correct?

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926658#comment-16926658
 ] 

angerszhu commented on SPARK-29038:
---

I am doing a similar framework. It can trigger cache sub-query data of sql when 
it satisfy some condition, and when new sql come, it can check LogicalPlan , if 
have  same  part, rewrite LogicalPlan to use cached data. 

 Now it support cache data in memory and alluxio,.

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2019-09-10 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29037:

Description: 
For a stage, whose tasks commit output, a task saves output to a staging dir 
firstly, when all tasks of this stage success, all task output under staging 
dir will be moved to destination dir.

However, when we kill an application, which is committing tasks' output, parts 
of tasks' results will be kept in staging dir, which would not be cleared 
gracefully.

Then we rerun this application and the new application will reuse this staging 
dir.

And when the task commit stage of new application success, all task output 
under this staging dir, which contains parts of old application's task output , 
would be moved to destination dir and the result is duplicated.

More common case, I think it is confused that several application running with 
same root path simultaneously, they will have same staging dir for same jobId.

  was:
Case:

A spark application  was be killed due to long-running.
Then we re-run this application, we find that spark gives duplicated result.


> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: feiwang
>Priority: Major
>
> For a stage, whose tasks commit output, a task saves output to a staging dir 
> firstly, when all tasks of this stage success, all task output under staging 
> dir will be moved to destination dir.
> However, when we kill an application, which is committing tasks' output, 
> parts of tasks' results will be kept in staging dir, which would not be 
> cleared gracefully.
> Then we rerun this application and the new application will reuse this 
> staging dir.
> And when the task commit stage of new application success, all task output 
> under this staging dir, which contains parts of old application's task output 
> , would be moved to destination dir and the result is duplicated.
> More common case, I think it is confused that several application running 
> with same root path simultaneously, they will have same staging dir for same 
> jobId.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2019-09-10 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29037:

Affects Version/s: (was: 2.3.1)
   2.1.0

> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: feiwang
>Priority: Major
>
> For a stage, whose tasks commit output, a task saves output to a staging dir 
> firstly, when all tasks of this stage success, all task output under staging 
> dir will be moved to destination dir.
> However, when we kill an application, which is committing tasks' output, 
> parts of tasks' results will be kept in staging dir, which would not be 
> cleared gracefully.
> Then we rerun this application and the new application will reuse this 
> staging dir.
> And when the task commit stage of new application success, all task output 
> under this staging dir, which contains parts of old application's task output 
> , would be moved to destination dir and the result is duplicated.
> More common case, I think it is confused that several application running 
> with same root path simultaneously, they will have same staging dir for same 
> jobId.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails

2019-09-10 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926707#comment-16926707
 ] 

koert kuipers commented on SPARK-29027:
---

i tried doing tests in a virtual machine and they pass
so its something in my environment (or should u say in all our corporate 
laptops and servers) but i have no idea what it could be right now

> KafkaDelegationTokenSuite fails
> ---
>
> Key: SPARK-29027
> URL: https://issues.apache.org/jira/browse/SPARK-29027
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code}
> commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4
> Author: Sean Owen 
> Date:   Mon Sep 9 10:19:40 2019 -0500
> {code}
> Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
>Reporter: koert kuipers
>Priority: Minor
>
> i am seeing consistent failure of KafkaDelegationTokenSuite on master
> {code}
> JsonUtilsSuite:
> - parsing partitions
> - parsing partitionOffsets
> KafkaDelegationTokenSuite:
> javax.security.sasl.SaslException: Failure to initialize security context 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125)
>   at 
> com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85)
>   at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48)
>   at 
> org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)
>   at 
> sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127)
>   at 
> sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193)
>   at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427)
>   at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62)
>   at 
> sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154)
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108)
>   ... 12 more
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED ***
>   org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
>   at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947)
>   at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924)
>   at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131)
>   at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93)
>   at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   ...
> KafkaSourceOffsetSuite:
> - comparison {"t":{"0":1}} <=> {"t":{"0":2}}
> - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}}
> - basic serialization - deserialization
> - OffsetSeqLog serialization - deserialization
> - read Spark 2.1.0 offset format
> {code}
> {code}
> [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  4.178 
> s]
> [INFO] Spark Project Tags . SUCCESS [  9.373 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [ 24.586

[jira] [Comment Edited] (SPARK-29027) KafkaDelegationTokenSuite fails

2019-09-10 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926707#comment-16926707
 ] 

koert kuipers edited comment on SPARK-29027 at 9/10/19 2:53 PM:


i tried doing tests in a virtual machine and they pass
so its something in my environment (or really in all our corporate laptops and 
servers) but i have no idea what it could be right now


was (Author: koert):
i tried doing tests in a virtual machine and they pass
so its something in my environment (or should u say in all our corporate 
laptops and servers) but i have no idea what it could be right now

> KafkaDelegationTokenSuite fails
> ---
>
> Key: SPARK-29027
> URL: https://issues.apache.org/jira/browse/SPARK-29027
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code}
> commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4
> Author: Sean Owen 
> Date:   Mon Sep 9 10:19:40 2019 -0500
> {code}
> Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
>Reporter: koert kuipers
>Priority: Minor
>
> i am seeing consistent failure of KafkaDelegationTokenSuite on master
> {code}
> JsonUtilsSuite:
> - parsing partitions
> - parsing partitionOffsets
> KafkaDelegationTokenSuite:
> javax.security.sasl.SaslException: Failure to initialize security context 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125)
>   at 
> com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85)
>   at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48)
>   at 
> org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)
>   at 
> sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127)
>   at 
> sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193)
>   at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427)
>   at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62)
>   at 
> sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154)
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108)
>   ... 12 more
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED ***
>   org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
>   at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947)
>   at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924)
>   at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131)
>   at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93)
>   at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   ...
> KafkaSourceOffsetSuite:
> - comparison {"t":{"0":1}} <=> {"t":{"0":2}}
> - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}}
> - basic serialization - deserialization
> - OffsetSeqLog serialization - deserialization
> - read Spark 2.1.0 offset format
> {code}
> {code}
> [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSH

[jira] [Updated] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2019-09-10 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29037:

Description: 
For a stage, whose tasks commit output, a task saves output to a staging dir 
firstly, when all tasks of this stage success, all task output under staging 
dir will be moved to destination dir.

However, when we kill an application, which is committing tasks' output, parts 
of tasks' results will be kept in staging dir, which would not be cleared 
gracefully.

Then we rerun this application and the new application will reuse this staging 
dir.

And when the task commit stage of new application success, all task output 
under this staging dir, which contains parts of old application's task output , 
would be moved to destination dir and the result is duplicated.



  was:
For a stage, whose tasks commit output, a task saves output to a staging dir 
firstly, when all tasks of this stage success, all task output under staging 
dir will be moved to destination dir.

However, when we kill an application, which is committing tasks' output, parts 
of tasks' results will be kept in staging dir, which would not be cleared 
gracefully.

Then we rerun this application and the new application will reuse this staging 
dir.

And when the task commit stage of new application success, all task output 
under this staging dir, which contains parts of old application's task output , 
would be moved to destination dir and the result is duplicated.

More common case, I think it is confused that several application running with 
same root path simultaneously, they will have same staging dir for same jobId.


> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: feiwang
>Priority: Major
>
> For a stage, whose tasks commit output, a task saves output to a staging dir 
> firstly, when all tasks of this stage success, all task output under staging 
> dir will be moved to destination dir.
> However, when we kill an application, which is committing tasks' output, 
> parts of tasks' results will be kept in staging dir, which would not be 
> cleared gracefully.
> Then we rerun this application and the new application will reuse this 
> staging dir.
> And when the task commit stage of new application success, all task output 
> under this staging dir, which contains parts of old application's task output 
> , would be moved to destination dir and the result is duplicated.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29028) Add links to IBM Cloud Object Storage connector in cloud-integration.md

2019-09-10 Thread Sean Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-29028:
-

Assignee: Dilip Biswal

> Add links to IBM Cloud Object Storage connector in cloud-integration.md
> ---
>
> Key: SPARK-29028
> URL: https://issues.apache.org/jira/browse/SPARK-29028
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29028) Add links to IBM Cloud Object Storage connector in cloud-integration.md

2019-09-10 Thread Sean Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-29028.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25737
[https://github.com/apache/spark/pull/25737]

> Add links to IBM Cloud Object Storage connector in cloud-integration.md
> ---
>
> Key: SPARK-29028
> URL: https://issues.apache.org/jira/browse/SPARK-29028
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29039) centralize the catalog and table lookup logic

2019-09-10 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-29039:
---

 Summary: centralize the catalog and table lookup logic
 Key: SPARK-29039
 URL: https://issues.apache.org/jira/browse/SPARK-29039
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28982) Support ThriftServer GetTypeInfoOperation for Spark's own type

2019-09-10 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-28982.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25694
[https://github.com/apache/spark/pull/25694]

> Support ThriftServer GetTypeInfoOperation for Spark's own type
> --
>
> Key: SPARK-28982
> URL: https://issues.apache.org/jira/browse/SPARK-28982
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently {{!typeinfo}} returns INTERVAL_YEAR_MONTH, INTERVAL_DAY_TIME, 
> ARRAY, MAP, STRUCT, UNIONTYPE and USER_DEFINED, all of which Spark turns into 
> string.
> Maybe we should make SparkGetTypeInfoOperation, to exclude types which we 
> don't support?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28982) Support ThriftServer GetTypeInfoOperation for Spark's own type

2019-09-10 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-28982:
---

Assignee: angerszhu

> Support ThriftServer GetTypeInfoOperation for Spark's own type
> --
>
> Key: SPARK-28982
> URL: https://issues.apache.org/jira/browse/SPARK-28982
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Currently {{!typeinfo}} returns INTERVAL_YEAR_MONTH, INTERVAL_DAY_TIME, 
> ARRAY, MAP, STRUCT, UNIONTYPE and USER_DEFINED, all of which Spark turns into 
> string.
> Maybe we should make SparkGetTypeInfoOperation, to exclude types which we 
> don't support?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails

2019-09-10 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926809#comment-16926809
 ] 

koert kuipers commented on SPARK-29027:
---

[~gsomogyi] do you use any services that require open ports perhaps? i am 
thinking it could be firewall issue, or host to ip mapping?

> KafkaDelegationTokenSuite fails
> ---
>
> Key: SPARK-29027
> URL: https://issues.apache.org/jira/browse/SPARK-29027
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code}
> commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4
> Author: Sean Owen 
> Date:   Mon Sep 9 10:19:40 2019 -0500
> {code}
> Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
>Reporter: koert kuipers
>Priority: Minor
>
> i am seeing consistent failure of KafkaDelegationTokenSuite on master
> {code}
> JsonUtilsSuite:
> - parsing partitions
> - parsing partitionOffsets
> KafkaDelegationTokenSuite:
> javax.security.sasl.SaslException: Failure to initialize security context 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125)
>   at 
> com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85)
>   at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48)
>   at 
> org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)
>   at 
> sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127)
>   at 
> sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193)
>   at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427)
>   at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62)
>   at 
> sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154)
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108)
>   ... 12 more
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED ***
>   org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
>   at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947)
>   at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924)
>   at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131)
>   at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93)
>   at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   ...
> KafkaSourceOffsetSuite:
> - comparison {"t":{"0":1}} <=> {"t":{"0":2}}
> - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}}
> - basic serialization - deserialization
> - OffsetSeqLog serialization - deserialization
> - read Spark 2.1.0 offset format
> {code}
> {code}
> [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  4.178 
> s]
> [INFO] Spark Project Tags . SUCCESS [  9.373 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [ 24.586 
> s]
> [INFO] Spark Project Local DB ..

[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-10 Thread Liang-Chi Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926814#comment-16926814
 ] 

Liang-Chi Hsieh commented on SPARK-28927:
-

Hi [~JerryHouse], do you use any non-deterministic operations when preparing 
your training dataset, like sample, filtering based on random number, etc.?

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Priority: Major
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
> Here is the our code:
> val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, 
> y._2.toFloat)))
>   .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class 
> ALSData(user:Int, item:Int, rating:Float) extends

[jira] [Created] (SPARK-29040) Support pyspark.createDataFrame from a pyarrow.Table

2019-09-10 Thread Bryan Cutler (Jira)

Bryan Cutler created SPARK-29040:


 Summary: Support pyspark.createDataFrame from a pyarrow.Table
 Key: SPARK-29040
 URL: https://issues.apache.org/jira/browse/SPARK-29040
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Bryan Cutler


PySpark {{createDataFrame}} currently supports creating a spark DataFrame from 
Pandas, using Arrow if enabled. This could be extended to accept a 
{{pyarrow.Table}} which has the added benefit of being able to efficiently use 
columns with nested struct types.

It is possible to convert a pyarrow.Table with nested columns into a 
pandas.DataFrame, but the data becomes dictionaries, and is not a performant 
way to parallelize the data.

Time/Date columns would need to be handled specially, since pyspark currently 
uses pandas to convert Arrow data of these types to the required Spark internal 
format.

This follows from a mailing list discussion at 
http://apache-spark-user-list.1001560.n3.nabble.com/question-about-pyarrow-Table-to-pyspark-DataFrame-conversion-td36110.html



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29029) PhysicalOperation.collectProjectsAndFilters should use AttributeMap while substituting aliases

2019-09-10 Thread Nikita Konda (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikita Konda updated SPARK-29029:
-
Component/s: SQL

> PhysicalOperation.collectProjectsAndFilters should use AttributeMap while 
> substituting aliases
> --
>
> Key: SPARK-29029
> URL: https://issues.apache.org/jira/browse/SPARK-29029
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.3.0
>Reporter: Nikita Konda
>Priority: Major
>
> We have a specific use case where in we are trying insert a custom logical 
> operator in our logical plan to avoid some of the Spark’s optimization rules. 
> However, we remove this logical operator as part of custom optimization rule 
> before we send this to SparkStrategies.
> However, we are hitting issue in the following scenario:
> Analyzed plan:
> {code:java}
> [1] Project [userid#0]
> +- [2] SubqueryAlias tmp6
>+- [3] Project [videoid#47L, avebitrate#2, userid#0]
>   +- [4] Filter NOT (videoid#47L = cast(30 as bigint))
>  +- [5] SubqueryAlias tmp5
> +- [6] CustomBarrier
>+- [7] Project [videoid#47L, avebitrate#2, userid#0]
>   +- [8] Filter (avebitrate#2 < 10)
>  +- [9] SubqueryAlias tmp3
> +- [10] Project [avebitrate#2, factorial(videoid#1) 
> AS videoid#47L, userid#0]
>+- [11] SubqueryAlias tmp2
>   +- [12] Project [userid#0, videoid#1, 
> avebitrate#2]
>  +- [13] SubqueryAlias tmp1
> +- [14] Project [userid#0, videoid#1, 
> avebitrate#2]
>+- [15] SubqueryAlias views
>   +- [16] 
> Relation[userid#0,videoid#1,avebitrate#2] 
> {code}
>  
> Optimized Plan:
> {code:java}
> [1] Project [userid#0]
> +- [2] Filter (isnotnull(videoid#47L) && NOT (videoid#47L = 30))
>+- [3] Project [factorial(videoid#1) AS videoid#47L, userid#0]
>   +- [4] Filter (isnotnull(avebitrate#2) && (avebitrate#2 < 10))
>  +- [5] Relation[userid#0,videoid#1,avebitrate#2]
> {code}
>  
>  When this plan is passed into *PhysicalOperation* in *DataSourceStrategy*, 
> the collectProjectsAndFilters collects filters as 
> List[[+AttributeReference("videoid#47L"), 
> AttributeReference("avebitrate#2")]+|#47L), 
> AttributeReference(avebitrate#2)]. However, at this stage the base relation 
> only has videoid#1 and hence it throws exception saying *key not found: 
> videoid#47L.*
>  On looking further, noticed that the alias map in 
> *PhysicalOperation.substitute* does have the entry with key *videoid#47L* -> 
> Aliases Map((videoid#47L, factorial(videoid#1))). However, the substitute 
> alias is not substituting the expression for alias videoid#47L because they 
> differ in qualifier parameter.
>  Attribute key in Alias: AttributeReference("videoid", LongType, nullable = 
> true)(ExprId(47, _), *"None"*)
>  Attribute in Filter condition: AttributeReference("videoid", LongType, 
> nullable = true)(ExprId(47, _), *"Some(tmp5)"*)
> Both differ only in the qualifier, however for alias map if we use 
> AttributeMap instead of Map[Attribute, Expression], we can get rid of the 
> above issue. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29014) DataSourceV2: Clean up current, default, and session catalog uses

2019-09-10 Thread Ryan Blue (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926889#comment-16926889
 ] 

Ryan Blue commented on SPARK-29014:
---

[~cloud_fan], why does this require a major refactor?

It would be best to keep the implementation of this as small as possible and 
not tie it to other work.

> DataSourceV2: Clean up current, default, and session catalog uses
> -
>
> Key: SPARK-29014
> URL: https://issues.apache.org/jira/browse/SPARK-29014
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Blocker
>
> Catalog tracking in DSv2 has evolved since the initial changes went in. We 
> need to make sure that handling is consistent across plans using the latest 
> rules:
>  * The _current_ catalog should be used when no catalog is specified
>  * The _default_ catalog is the catalog _current_ is initialized to
>  * If the _default_ catalog is not set, then it is the built-in Spark session 
> catalog, which will be called `spark_catalog` (This is the v2 session catalog)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28902) Spark ML Pipeline with nested Pipelines fails to load when saved from Python

2019-09-10 Thread Junichi Koizumi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927074#comment-16927074
 ] 

Junichi  Koizumi  commented on SPARK-28902:
---

  Since, versions aren't the main concern here should I create a PR ? 

> Spark ML Pipeline with nested Pipelines fails to load when saved from Python
> 
>
> Key: SPARK-28902
> URL: https://issues.apache.org/jira/browse/SPARK-28902
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Saif Addin
>Priority: Minor
>
> Hi, this error is affecting a bunch of our nested use cases.
> Saving a *PipelineModel* with one of its stages being another 
> *PipelineModel*, fails when loading it from Scala if it is saved in Python.
> *Python side:*
>  
> {code:java}
> from pyspark.ml import Pipeline
> from pyspark.ml.feature import Tokenizer
> t = Tokenizer()
> p = Pipeline().setStages([t])
> d = spark.createDataFrame([["Hello Peter Parker"]])
> pm = p.fit(d)
> np = Pipeline().setStages([pm])
> npm = np.fit(d)
> npm.write().save('./npm_test')
> {code}
>  
>  
> *Scala side:*
>  
> {code:java}
> scala> import org.apache.spark.ml.PipelineModel
> scala> val pp = PipelineModel.load("./npm_test")
> java.lang.IllegalArgumentException: requirement failed: Error loading 
> metadata: Expected class name org.apache.spark.ml.PipelineModel but found 
> class name pyspark.ml.pipeline.PipelineModel
>  at scala.Predef$.require(Predef.scala:224)
>  at 
> org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:638)
>  at 
> org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:267)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342)
>  at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380)
>  at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332)
>  ... 50 elided
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-28902) Spark ML Pipeline with nested Pipelines fails to load when saved from Python

2019-09-10 Thread Junichi Koizumi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junichi  Koizumi  updated SPARK-28902:
--
Comment: was deleted

(was:   Since, versions aren't the main concern here should I create a PR ? )

> Spark ML Pipeline with nested Pipelines fails to load when saved from Python
> 
>
> Key: SPARK-28902
> URL: https://issues.apache.org/jira/browse/SPARK-28902
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Saif Addin
>Priority: Minor
>
> Hi, this error is affecting a bunch of our nested use cases.
> Saving a *PipelineModel* with one of its stages being another 
> *PipelineModel*, fails when loading it from Scala if it is saved in Python.
> *Python side:*
>  
> {code:java}
> from pyspark.ml import Pipeline
> from pyspark.ml.feature import Tokenizer
> t = Tokenizer()
> p = Pipeline().setStages([t])
> d = spark.createDataFrame([["Hello Peter Parker"]])
> pm = p.fit(d)
> np = Pipeline().setStages([pm])
> npm = np.fit(d)
> npm.write().save('./npm_test')
> {code}
>  
>  
> *Scala side:*
>  
> {code:java}
> scala> import org.apache.spark.ml.PipelineModel
> scala> val pp = PipelineModel.load("./npm_test")
> java.lang.IllegalArgumentException: requirement failed: Error loading 
> metadata: Expected class name org.apache.spark.ml.PipelineModel but found 
> class name pyspark.ml.pipeline.PipelineModel
>  at scala.Predef$.require(Predef.scala:224)
>  at 
> org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:638)
>  at 
> org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:267)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342)
>  at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380)
>  at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332)
>  ... 50 elided
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28902) Spark ML Pipeline with nested Pipelines fails to load when saved from Python

2019-09-10 Thread Junichi Koizumi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927076#comment-16927076
 ] 

Junichi  Koizumi  commented on SPARK-28902:
---

Since versions aren't the main concern here should I create a PR ? 

> Spark ML Pipeline with nested Pipelines fails to load when saved from Python
> 
>
> Key: SPARK-28902
> URL: https://issues.apache.org/jira/browse/SPARK-28902
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Saif Addin
>Priority: Minor
>
> Hi, this error is affecting a bunch of our nested use cases.
> Saving a *PipelineModel* with one of its stages being another 
> *PipelineModel*, fails when loading it from Scala if it is saved in Python.
> *Python side:*
>  
> {code:java}
> from pyspark.ml import Pipeline
> from pyspark.ml.feature import Tokenizer
> t = Tokenizer()
> p = Pipeline().setStages([t])
> d = spark.createDataFrame([["Hello Peter Parker"]])
> pm = p.fit(d)
> np = Pipeline().setStages([pm])
> npm = np.fit(d)
> npm.write().save('./npm_test')
> {code}
>  
>  
> *Scala side:*
>  
> {code:java}
> scala> import org.apache.spark.ml.PipelineModel
> scala> val pp = PipelineModel.load("./npm_test")
> java.lang.IllegalArgumentException: requirement failed: Error loading 
> metadata: Expected class name org.apache.spark.ml.PipelineModel but found 
> class name pyspark.ml.pipeline.PipelineModel
>  at scala.Predef$.require(Predef.scala:224)
>  at 
> org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:638)
>  at 
> org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:267)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342)
>  at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380)
>  at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332)
>  ... 50 elided
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28902) Spark ML Pipeline with nested Pipelines fails to load when saved from Python

2019-09-10 Thread Saif Addin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927077#comment-16927077
 ] 

Saif Addin commented on SPARK-28902:


Ah, here I thought you said you couldn't reproduce it. Gladly hoping to see 
this fixed :)

> Spark ML Pipeline with nested Pipelines fails to load when saved from Python
> 
>
> Key: SPARK-28902
> URL: https://issues.apache.org/jira/browse/SPARK-28902
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Saif Addin
>Priority: Minor
>
> Hi, this error is affecting a bunch of our nested use cases.
> Saving a *PipelineModel* with one of its stages being another 
> *PipelineModel*, fails when loading it from Scala if it is saved in Python.
> *Python side:*
>  
> {code:java}
> from pyspark.ml import Pipeline
> from pyspark.ml.feature import Tokenizer
> t = Tokenizer()
> p = Pipeline().setStages([t])
> d = spark.createDataFrame([["Hello Peter Parker"]])
> pm = p.fit(d)
> np = Pipeline().setStages([pm])
> npm = np.fit(d)
> npm.write().save('./npm_test')
> {code}
>  
>  
> *Scala side:*
>  
> {code:java}
> scala> import org.apache.spark.ml.PipelineModel
> scala> val pp = PipelineModel.load("./npm_test")
> java.lang.IllegalArgumentException: requirement failed: Error loading 
> metadata: Expected class name org.apache.spark.ml.PipelineModel but found 
> class name pyspark.ml.pipeline.PipelineModel
>  at scala.Predef$.require(Predef.scala:224)
>  at 
> org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:638)
>  at 
> org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:267)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342)
>  at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380)
>  at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332)
>  ... 50 elided
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29041) Allow createDataFrame to accept bytes as binary type

2019-09-10 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-29041:


 Summary: Allow createDataFrame to accept bytes as binary type
 Key: SPARK-29041
 URL: https://issues.apache.org/jira/browse/SPARK-29041
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.4, 3.0.0
Reporter: Hyukjin Kwon


```
spark.createDataFrame([[b"abcd"]], "col binary")
```

simply fails. bytes should also be able to accepted as binary type



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29041) Allow createDataFrame to accept bytes as binary type

2019-09-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29041:
-
Description: 
{code}
spark.createDataFrame([[b"abcd"]], "col binary")
{code}

simply fails. bytes should also be able to accepted as binary type

  was:
```
spark.createDataFrame([[b"abcd"]], "col binary")
```

simply fails. bytes should also be able to accepted as binary type


> Allow createDataFrame to accept bytes as binary type
> 
>
> Key: SPARK-29041
> URL: https://issues.apache.org/jira/browse/SPARK-29041
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> spark.createDataFrame([[b"abcd"]], "col binary")
> {code}
> simply fails. bytes should also be able to accepted as binary type



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29041) Allow createDataFrame to accept bytes as binary type

2019-09-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29041:
-
Description: 
{code}
spark.createDataFrame([[b"abcd"]], "col binary")
{code}

simply fails as below:

{code}

{code}

bytes should also be able to accepted as binary type

  was:
{code}
spark.createDataFrame([[b"abcd"]], "col binary")
{code}

simply fails. bytes should also be able to accepted as binary type


> Allow createDataFrame to accept bytes as binary type
> 
>
> Key: SPARK-29041
> URL: https://issues.apache.org/jira/browse/SPARK-29041
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> spark.createDataFrame([[b"abcd"]], "col binary")
> {code}
> simply fails as below:
> {code}
> {code}
> bytes should also be able to accepted as binary type



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29041) Allow createDataFrame to accept bytes as binary type

2019-09-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29041:
-
Description: 
{code}
spark.createDataFrame([[b"abcd"]], "col binary")
{code}

simply fails as below:

in Python 3

{code}
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/sql/session.py", line 787, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/.../spark/python/pyspark/sql/session.py", line 442, in _createFromLocal
data = list(data)
  File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
verify_func(obj)
  File "/.../forked/spark/python/pyspark/sql/types.py", line 1403, in verify
verify_value(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
verifier(v)
  File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
verify_value(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
verify_acceptable_types(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 1282, in 
verify_acceptable_types
% (dataType, obj, type(obj
TypeError: field col: BinaryType can not accept object b'abcd' in type 
{code}

in Python 2:

{code}
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/sql/session.py", line 787, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/.../spark/python/pyspark/sql/session.py", line 442, in _createFromLocal
data = list(data)
  File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
verify_func(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
verify_value(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
verifier(v)
  File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
verify_value(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
verify_acceptable_types(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 1282, in 
verify_acceptable_types
% (dataType, obj, type(obj
TypeError: field col: BinaryType can not accept object 'abcd' in type 
{code}

{{bytes}} should also be able to accepted as binary type

  was:
{code}
spark.createDataFrame([[b"abcd"]], "col binary")
{code}

simply fails as below:

{code}

{code}

bytes should also be able to accepted as binary type


> Allow createDataFrame to accept bytes as binary type
> 
>
> Key: SPARK-29041
> URL: https://issues.apache.org/jira/browse/SPARK-29041
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> spark.createDataFrame([[b"abcd"]], "col binary")
> {code}
> simply fails as below:
> in Python 3
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/sql/session.py", line 787, in 
> createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
>   File "/.../spark/python/pyspark/sql/session.py", line 442, in 
> _createFromLocal
> data = list(data)
>   File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
> verify_func(obj)
>   File "/.../forked/spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
> verifier(v)
>   File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
> verify_acceptable_types(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1282, in 
> verify_acceptable_types
> % (dataType, obj, type(obj
> TypeError: field col: BinaryType can not accept object b'abcd' in type  'bytes'>
> {code}
> in Python 2:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/sql/session.py", line 787, in 
> createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
>   File "/.../spark/python/pyspark/sql/session.py", line 442, in 
> _createFromLocal
> data = list(data)
>   File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
> verify_func(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
> verifier(v)
>   File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
> verify_value(obj)
>   File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
> verify_acceptable_types(obj)
>

[jira] [Updated] (SPARK-29001) Print better log when process of events becomes slow

2019-09-10 Thread Xingbo Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang updated SPARK-29001:
-
Summary: Print better log when process of events becomes slow  (was: Print 
event thread stack trace when EventQueue starts to drop events)

> Print better log when process of events becomes slow
> 
>
> Key: SPARK-29001
> URL: https://issues.apache.org/jira/browse/SPARK-29001
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Minor
>
> We shall print event thread stack trace when EventQueue starts to drop 
> events, this help us find out what type of events is slow.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29001) Print better log when process of events becomes slow

2019-09-10 Thread Xingbo Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang updated SPARK-29001:
-
Description: We shall print better log when process of events becomes slow, 
to help find out what type of events is slow.  (was: We shall print event 
thread stack trace when EventQueue starts to drop events, this help us find out 
what type of events is slow.)

> Print better log when process of events becomes slow
> 
>
> Key: SPARK-29001
> URL: https://issues.apache.org/jira/browse/SPARK-29001
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Minor
>
> We shall print better log when process of events becomes slow, to help find 
> out what type of events is slow.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29026) Improve error message when constructor in `ScalaReflection` isn't found

2019-09-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-29026:


Assignee: Mick Jermsurawong

> Improve error message when constructor in `ScalaReflection` isn't found  
> -
>
> Key: SPARK-29026
> URL: https://issues.apache.org/jira/browse/SPARK-29026
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Mick Jermsurawong
>Assignee: Mick Jermsurawong
>Priority: Minor
>
> Currently, a method to get constructor parameters from a given type 
> `constructParams` in `ScalaReflection` will throw exception if the type has 
> no constructor 
> {code:java}
>  is not a term
> scala.ScalaReflectionException:  {code}
> In the normal usage of ExpressionEncoder, this can happen if the type is 
> interface extending `scala.Product`.
> Also, since this is a protected method, this could have been other arbitrary 
> types without constructor.
> To reproduce the error, the following will fail when trying to get 
> {{Encoder[NoConstructorProductTrait]}} 
> {code:java}
> trait NoConstructorProductTrait extends scala.Product {} {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29026) Improve error message when constructor in `ScalaReflection` isn't found

2019-09-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29026.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25736
[https://github.com/apache/spark/pull/25736]

> Improve error message when constructor in `ScalaReflection` isn't found  
> -
>
> Key: SPARK-29026
> URL: https://issues.apache.org/jira/browse/SPARK-29026
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Mick Jermsurawong
>Assignee: Mick Jermsurawong
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, a method to get constructor parameters from a given type 
> `constructParams` in `ScalaReflection` will throw exception if the type has 
> no constructor 
> {code:java}
>  is not a term
> scala.ScalaReflectionException:  {code}
> In the normal usage of ExpressionEncoder, this can happen if the type is 
> interface extending `scala.Product`.
> Also, since this is a protected method, this could have been other arbitrary 
> types without constructor.
> To reproduce the error, the following will fail when trying to get 
> {{Encoder[NoConstructorProductTrait]}} 
> {code:java}
> trait NoConstructorProductTrait extends scala.Product {} {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28570) Shuffle Storage API: Use writer API in UnsafeShuffleWriter

2019-09-10 Thread Marcelo Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-28570.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25304
[https://github.com/apache/spark/pull/25304]

> Shuffle Storage API: Use writer API in UnsafeShuffleWriter
> --
>
> Key: SPARK-28570
> URL: https://issues.apache.org/jira/browse/SPARK-28570
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Matt Cheah
>Assignee: Matt Cheah
>Priority: Major
> Fix For: 3.0.0
>
>
> Use the APIs introduced in SPARK-28209 in the UnsafeShuffleWriter.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28570) Shuffle Storage API: Use writer API in UnsafeShuffleWriter

2019-09-10 Thread Marcelo Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-28570:
--

Assignee: Matt Cheah

> Shuffle Storage API: Use writer API in UnsafeShuffleWriter
> --
>
> Key: SPARK-28570
> URL: https://issues.apache.org/jira/browse/SPARK-28570
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Matt Cheah
>Assignee: Matt Cheah
>Priority: Major
>
> Use the APIs introduced in SPARK-28209 in the UnsafeShuffleWriter.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25157) Streaming of image files from directory

2019-09-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25157.
--
Resolution: Duplicate

> Streaming of image files from directory
> ---
>
> Key: SPARK-25157
> URL: https://issues.apache.org/jira/browse/SPARK-25157
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Amit Baghel
>Priority: Major
>
> We are doing video analytics for video streams using Spark. At present there 
> is no direct way to stream video frames or image files to Spark and process 
> them using Structured Streaming and Dataset. We are using Kafka to stream 
> images and then doing processing at spark. We need a method in Spark to 
> stream images from directory. Currently *{{DataStreamReader}}* doesn't 
> support Image files. With the introduction of 
> *org.apache.spark.ml.image.ImageSchema* class, we think streaming 
> capabilities can be added for image files. It is fine if it won't support 
> some of the structured streaming features as it is a binary file. This method 
> could be similar to *mmlspark* *streamImages* method. 
> [https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29042) Sampling-based RDD with unordered input should be INDETERMINATE

2019-09-10 Thread Liang-Chi Hsieh (Jira)

Liang-Chi Hsieh created SPARK-29042:
---

 Summary: Sampling-based RDD with unordered input should be 
INDETERMINATE
 Key: SPARK-29042
 URL: https://issues.apache.org/jira/browse/SPARK-29042
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


We have found and fixed the correctness issue when RDD output is INDETERMINATE. 
One missing part is sampling-based RDD. This kind of RDDs is order sensitive to 
its input. A sampling-based RDD with unordered input, should be INDETERMINATE.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler

2019-09-10 Thread feiwang (Jira)

feiwang created SPARK-29043:
---

 Summary: [History Server]Only one replay thread of 
FsHistoryProvider work because of straggler
 Key: SPARK-29043
 URL: https://issues.apache.org/jira/browse/SPARK-29043
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29038:
--
Comment: was deleted

(was: I am doing a similar framework. It can trigger cache sub-query data of 
sql when it satisfy some condition, and when new sql come, it can check 
LogicalPlan , if have  same  part, rewrite LogicalPlan to use cached data. 

 Now it support cache data in memory and alluxio,.)

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29027) KafkaDelegationTokenSuite fails

2019-09-10 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927189#comment-16927189
 ] 

Jungtaek Lim commented on SPARK-29027:
--

[~koert]

Please try to mv krb5.conf to other and run the test again. If it works, please 
find "EXAMPLE.COM" is defined as realm in krb5.conf.

> KafkaDelegationTokenSuite fails
> ---
>
> Key: SPARK-29027
> URL: https://issues.apache.org/jira/browse/SPARK-29027
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code}
> commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4
> Author: Sean Owen 
> Date:   Mon Sep 9 10:19:40 2019 -0500
> {code}
> Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
>Reporter: koert kuipers
>Priority: Minor
>
> i am seeing consistent failure of KafkaDelegationTokenSuite on master
> {code}
> JsonUtilsSuite:
> - parsing partitions
> - parsing partitionOffsets
> KafkaDelegationTokenSuite:
> javax.security.sasl.SaslException: Failure to initialize security context 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125)
>   at 
> com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85)
>   at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48)
>   at 
> org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)
>   at 
> sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127)
>   at 
> sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193)
>   at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427)
>   at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62)
>   at 
> sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154)
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108)
>   ... 12 more
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED ***
>   org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
>   at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947)
>   at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924)
>   at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131)
>   at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93)
>   at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   ...
> KafkaSourceOffsetSuite:
> - comparison {"t":{"0":1}} <=> {"t":{"0":2}}
> - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}}
> - basic serialization - deserialization
> - OffsetSeqLog serialization - deserialization
> - read Spark 2.1.0 offset format
> {code}
> {code}
> [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  4.178 
> s]
> [INFO] Spark Project Tags . SUCCESS [  9.373 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [ 24.586 
> s]
> [INFO] Spark Project Local DB ...

[jira] [Comment Edited] (SPARK-29027) KafkaDelegationTokenSuite fails

2019-09-10 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927189#comment-16927189
 ] 

Jungtaek Lim edited comment on SPARK-29027 at 9/11/19 1:59 AM:
---

[~koert]

Please try to mv krb5.conf to other and run the test again. If it works, please 
find "EXAMPLE.COM" is defined as realm in krb5.conf, as MiniKdc seems to use it 
for default configuration.


was (Author: kabhwan):
[~koert]

Please try to mv krb5.conf to other and run the test again. If it works, please 
find "EXAMPLE.COM" is defined as realm in krb5.conf.

> KafkaDelegationTokenSuite fails
> ---
>
> Key: SPARK-29027
> URL: https://issues.apache.org/jira/browse/SPARK-29027
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code}
> commit 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4
> Author: Sean Owen 
> Date:   Mon Sep 9 10:19:40 2019 -0500
> {code}
> Ubuntu 16.04 with OpenJDK 1.8 (1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
>Reporter: koert kuipers
>Priority: Minor
>
> i am seeing consistent failure of KafkaDelegationTokenSuite on master
> {code}
> JsonUtilsSuite:
> - parsing partitions
> - parsing partitionOffsets
> KafkaDelegationTokenSuite:
> javax.security.sasl.SaslException: Failure to initialize security context 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:125)
>   at 
> com.sun.security.sasl.gsskerb.FactoryImpl.createSaslServer(FactoryImpl.java:85)
>   at javax.security.sasl.Sasl.createSaslServer(Sasl.java:524)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:118)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer$1.run(ZooKeeperSaslServer.java:114)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:114)
>   at 
> org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:48)
>   at 
> org.apache.zookeeper.server.NIOServerCnxn.(NIOServerCnxn.java:100)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.createConnection(NIOServerCnxnFactory.java:156)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:197)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos credentails)
>   at 
> sun.security.jgss.krb5.Krb5AcceptCredential.getInstance(Krb5AcceptCredential.java:87)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:127)
>   at 
> sun.security.jgss.GSSManagerImpl.getCredentialElement(GSSManagerImpl.java:193)
>   at sun.security.jgss.GSSCredentialImpl.add(GSSCredentialImpl.java:427)
>   at sun.security.jgss.GSSCredentialImpl.(GSSCredentialImpl.java:62)
>   at 
> sun.security.jgss.GSSManagerImpl.createCredential(GSSManagerImpl.java:154)
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Server.(GssKrb5Server.java:108)
>   ... 12 more
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED ***
>   org.I0Itec.zkclient.exception.ZkAuthFailedException: Authentication failure
>   at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:947)
>   at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:924)
>   at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1231)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:157)
>   at org.I0Itec.zkclient.ZkClient.(ZkClient.java:131)
>   at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:93)
>   at kafka.utils.ZkUtils$.apply(ZkUtils.scala:75)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedZookeeper(KafkaTestUtils.scala:202)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:243)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   ...
> KafkaSourceOffsetSuite:
> - comparison {"t":{"0":1}} <=> {"t":{"0":2}}
> - comparison {"t":{"1":0,"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1},"T":{"0":0}} <=> {"t":{"0":2},"T":{"0":1}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":1,"0":2}}
> - comparison {"t":{"0":1}} <=> {"t":{"1":3,"0":2}}
> - basic serialization - deserialization
> - OffsetSeqLog serialization - deserialization
> - read Spark 2.1.0 offset format
> {code}
> {code}
> [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent

[jira] [Updated] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler

2019-09-10 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29043:

Attachment: screenshot-1.png

> [History Server]Only one replay thread of FsHistoryProvider work because of 
> straggler
> -
>
> Key: SPARK-29043
> URL: https://issues.apache.org/jira/browse/SPARK-29043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler

2019-09-10 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29043:

Description: 
As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
spark history server.
However, there is only one replay thread work because of 

> [History Server]Only one replay thread of FsHistoryProvider work because of 
> straggler
> -
>
> Key: SPARK-29043
> URL: https://issues.apache.org/jira/browse/SPARK-29043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
> spark history server.
> However, there is only one replay thread work because of 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler

2019-09-10 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29043:

Description: 
As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
spark history server.
However, there is only one replay thread work because of straggler.



  was:
As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
spark history server.
However, there is only one replay thread work because of 


> [History Server]Only one replay thread of FsHistoryProvider work because of 
> straggler
> -
>
> Key: SPARK-29043
> URL: https://issues.apache.org/jira/browse/SPARK-29043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
> spark history server.
> However, there is only one replay thread work because of straggler.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29044) Resolved attribute(s) R#661751,residue#661752 missing from ipi#660814,residue#660731,exper_set#660827,R#660730,description#660815,sequence#660817,exper#660828,symbol#660

2019-09-10 Thread Kristine Senkane (Jira)

Kristine Senkane created SPARK-29044:


 Summary: Resolved attribute(s) R#661751,residue#661752 missing 
from 
ipi#660814,residue#660731,exper_set#660827,R#660730,description#660815,sequence#660817,exper#660828,symbol#660816
 Key: SPARK-29044
 URL: https://issues.apache.org/jira/browse/SPARK-29044
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.4.3
Reporter: Kristine Senkane


{code:java}
SELECT group_averages.*
FROM group_averages
NATURAL INNER JOIN (
SELECT MAX(R) AS max_R, ipi AS ipi, description AS description, symbol AS 
symbol, residue
FROM group_averages
GROUP BY ipi, description, symbol, residue
) AS all_rows_bigger_than_four
WHERE all_rows_bigger_than_four.max_R >= 4.0
{code}
causes,
{code:java}
---
Py4JJavaError Traceback (most recent call last)
/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:

/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:

Py4JJavaError: An error occurred while calling o21.sql.
: org.apache.spark.sql.AnalysisException: Resolved attribute(s) 
R#661751,residue#661752 missing from 
ipi#660814,residue#660731,exper_set#660827,R#660730,description#660815,sequence#660817,exper#660828,symbol#660816
 in operator !Project [ipi#660814, symbol#660816, description#660815, 
sequence#660817, R#661751, exper#660828, exper_set#660827, residue#661752]. 
Attribute(s) with the same name appear in the operation: R,residue. Please 
check if the right attribute(s) are used.;;
Project [ipi#660546, description#660547, symbol#660548, residue#660731, 
group_description#660716, total_residues_detected#660809L, 
num_datasets#660810L, R#660811]
+- Filter (max_R#661746 >= cast(4.0 as double))
   +- Project [ipi#660546, description#660547, symbol#660548, residue#660731, 
group_description#660716, total_residues_detected#660809L, 
num_datasets#660810L, R#660811, max_R#661746]
  +- Join Inner, ipi#660546 = ipi#661747) && (description#660547 = 
description#661748)) && (symbol#660548 = symbol#661749)) && (residue#660731 = 
residue#661752))
 :- SubqueryAlias `group_averages`
 :  +- Filter (num_datasets#660810L > cast(1 as bigint))
 : +- Aggregate [ipi#660546, description#660547, symbol#660548, 
residue#660731, exper_set#660559, group_description#660716, 
total_residues_detected#660809L], [ipi#660546, description#660547, 
symbol#660548, residue#660731, group_description#660716, 
total_residues_detected#660809L, count(R#660758) AS num_datasets#660810L, CASE 
WHEN (stddev_samp(R#660758) < (cast(0.6 as double) * avg(R#660758))) THEN 
avg(R#660758) ELSE CASE WHEN (min(R#660758) < cast(4 as double)) THEN 
min(R#660758) ELSE avg(R#660758) END END AS R#660811]
 :+- Project [ipi#660546, description#660547, symbol#660548, 
exper_set#660559, exper#660560, residue#660731, group_description#660716, 
R#660758, total_residues_detected#660809L]
 :   +- Join Inner, (((ipi#660546 = ipi#660814) && 
(description#660547 = description#660815)) && (symbol#660548 = symbol#660816))
 :  :- SubqueryAlias `table_by_residue`
 :  :  +- Aggregate [ipi#660546, description#660547, 
symbol#660548, residue#660731, exper_set#660559, exper#660560, 
group_description#660716], [ipi#660546, description#660547, symbol#660548, 
exper_set#660559, exper#660560, residue#660731, group_description#660716, CASE 
WHEN (stddev_samp(R#660730) < (cast(0.6 as double) * avg(R#660730))) THEN 
avg(R#660730) ELSE CASE WHEN (min(R#660730) < cast(4 as double)) THEN 
min(R#660730) ELSE avg(R#660730) END END AS R#660758]
 :  : +- Join Inner, (exper#660560 = Cimage link#660715)
 :  ::- SubqueryAlias `table_by_peptide`
 :  ::  +- Project [ipi#660546, symbol#660548, 
description#660547, sequence#660549, R#660730, exper#660560, exper_set#660559, 
residue#660731]
 :  :: +- Sort [ipi#660546 ASC NULLS FIRST], 
true
 :  ::+- Aggregate [exper#660560, 
ipi#660546, ((instr(protein_sequence#660699, regexp_replace(sequence#660549, 
[.*-], )) + instr(sequence#660549, *)) - 3), symbol#660548, exper_set#660559, 
sp#660696, sequence#660549, charge#660551, description#660547], [ipi#660546, 
symbol#660548, description#660547, sequence#660549, avg(cast(IR#660553 as 
double)) AS R#660730, exper#660560, ex

[jira] [Created] (SPARK-29045) Test failed due to table already exists in SQLMetricsSuite

2019-09-10 Thread Lantao Jin (Jira)

Lantao Jin created SPARK-29045:
--

 Summary: Test failed due to table already exists in SQLMetricsSuite
 Key: SPARK-29045
 URL: https://issues.apache.org/jira/browse/SPARK-29045
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Lantao Jin


In method [[SQLMetricsTestUtils.testMetricsDynamicPartition()]], there is a 
CREATE TABLE sentence without [[withTable]] block. It causes test failure if 
use the same table name in other unit tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29045) Test failed due to table already exists in SQLMetricsSuite

2019-09-10 Thread Lantao Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29045:
---
Description: In method 
{{SQLMetricsTestUtils.testMetricsDynamicPartition()}}, there is a CREATE TABLE 
sentence without {{withTable}} block. It causes test failure if use same table 
name in other unit tests.  (was: In method 
[[SQLMetricsTestUtils.testMetricsDynamicPartition()]], there is a CREATE TABLE 
sentence without [[withTable]] block. It causes test failure if use the same 
table name in other unit tests.)

> Test failed due to table already exists in SQLMetricsSuite
> --
>
> Key: SPARK-29045
> URL: https://issues.apache.org/jira/browse/SPARK-29045
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Minor
>
> In method {{SQLMetricsTestUtils.testMetricsDynamicPartition()}}, there is a 
> CREATE TABLE sentence without {{withTable}} block. It causes test failure if 
> use same table name in other unit tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler

2019-09-10 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29043:

Description: 
As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
spark history server.
However, there is only one replay thread work because of straggler.

Let's check the code.
https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547

  was:
As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
spark history server.
However, there is only one replay thread work because of straggler.

Let's check the code.
https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-LL547


> [History Server]Only one replay thread of FsHistoryProvider work because of 
> straggler
> -
>
> Key: SPARK-29043
> URL: https://issues.apache.org/jira/browse/SPARK-29043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
> spark history server.
> However, there is only one replay thread work because of straggler.
> Let's check the code.
> https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler

2019-09-10 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29043:

Description: 
As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
spark history server.
However, there is only one replay thread work because of straggler.

Let's check the code.
https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-LL547

  was:
As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
spark history server.
However, there is only one replay thread work because of straggler.




> [History Server]Only one replay thread of FsHistoryProvider work because of 
> straggler
> -
>
> Key: SPARK-29043
> URL: https://issues.apache.org/jira/browse/SPARK-29043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
> spark history server.
> However, there is only one replay thread work because of straggler.
> Let's check the code.
> https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-LL547



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler

2019-09-10 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29043:

Description: 
As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
spark history server.
However, there is only one replay thread work because of straggler.

Let's check the code.
https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547

There is a synchronous operation for all replay tasks.

  was:
As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
spark history server.
However, there is only one replay thread work because of straggler.

Let's check the code.
https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547


> [History Server]Only one replay thread of FsHistoryProvider work because of 
> straggler
> -
>
> Key: SPARK-29043
> URL: https://issues.apache.org/jira/browse/SPARK-29043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
> spark history server.
> However, there is only one replay thread work because of straggler.
> Let's check the code.
> https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547
> There is a synchronous operation for all replay tasks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler

2019-09-10 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927198#comment-16927198
 ] 

feiwang commented on SPARK-29043:
-

I think we can change it  to Asynchronous.

> [History Server]Only one replay thread of FsHistoryProvider work because of 
> straggler
> -
>
> Key: SPARK-29043
> URL: https://issues.apache.org/jira/browse/SPARK-29043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
> spark history server.
> However, there is only one replay thread work because of straggler.
> Let's check the code.
> https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547
> There is a synchronous operation for all replay tasks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler

2019-09-10 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927198#comment-16927198
 ] 

feiwang edited comment on SPARK-29043 at 9/11/19 2:26 AM:
--

I think it is better to replay logs asynchronously.


was (Author: hzfeiwang):
I think we can change it  to Asynchronous.

> [History Server]Only one replay thread of FsHistoryProvider work because of 
> straggler
> -
>
> Key: SPARK-29043
> URL: https://issues.apache.org/jira/browse/SPARK-29043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
> spark history server.
> However, there is only one replay thread work because of straggler.
> Let's check the code.
> https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547
> There is a synchronous operation for all replay tasks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler

2019-09-10 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927205#comment-16927205
 ] 

Jungtaek Lim commented on SPARK-29043:
--

It's asynchronous for replaying logs: it's synchronous for waiting for 
replaying logs to be finished, but it wouldn't matter much. So you may want to 
post full thread dump to show what other threads are doing when one thread is 
running for replaying logs. If it really doesn't work concurrently, there might 
be some other place being locked.

> [History Server]Only one replay thread of FsHistoryProvider work because of 
> straggler
> -
>
> Key: SPARK-29043
> URL: https://issues.apache.org/jira/browse/SPARK-29043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
> spark history server.
> However, there is only one replay thread work because of straggler.
> Let's check the code.
> https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547
> There is a synchronous operation for all replay tasks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Lantao Jin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927214#comment-16927214
 ] 

Lantao Jin commented on SPARK-29038:


[~mgaido] IIUC, there is no "query caching" in Spark, even no result cache. But 
Spark natively supports RDD-level cache. Multiple jobs can share cached RDD. 
The cached RDD is closer to the calculation result and requires less 
computation. In addition, the file system level cache such as HDFS cache or 
Alluxio can also load data into memory in advance, improving data processing 
efficiency. But materialized view actually is a technology about summaries 
*precalculating*. Summaries are special types of aggregate views that improve 
query execution times by precalculating expensive joins and aggregation 
operations prior to execution and storing the results in a table in the 
database. The query optimizer transparently rewrites the request to use the 
materialized view. Queries go directly to the materialized view and not to the 
underlying detail tables which had been materialized to storage like HDFS. 

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Lantao Jin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927214#comment-16927214
 ] 

Lantao Jin edited comment on SPARK-29038 at 9/11/19 3:24 AM:
-

[~mgaido] IIUC, there is no "query caching" in Spark, even no result cache. But 
Spark natively supports RDD-level cache. Multiple jobs can share cached RDD. 
The cached RDD is closer to the calculation result and requires less 
computation. In addition, the file system level cache such as HDFS cache or 
Alluxio can also load data into memory in advance, improving data processing 
efficiency. But materialized view actually is a technology about summaries 
*precalculating*. Summaries are special types of aggregate views that improve 
query execution times by precalculating expensive joins and aggregation 
operations prior to execution and storing the results in a table in the 
database. The query optimizer transparently rewrites the request to use the 
materialized view. Queries go directly to the materialized view  
 which had been persisted in storage (e.g HDFS) and not to the underlying 
detail tables. 


was (Author: cltlfcjin):
[~mgaido] IIUC, there is no "query caching" in Spark, even no result cache. But 
Spark natively supports RDD-level cache. Multiple jobs can share cached RDD. 
The cached RDD is closer to the calculation result and requires less 
computation. In addition, the file system level cache such as HDFS cache or 
Alluxio can also load data into memory in advance, improving data processing 
efficiency. But materialized view actually is a technology about summaries 
*precalculating*. Summaries are special types of aggregate views that improve 
query execution times by precalculating expensive joins and aggregation 
operations prior to execution and storing the results in a table in the 
database. The query optimizer transparently rewrites the request to use the 
materialized view. Queries go directly to the materialized view and not to the 
underlying detail tables which had been materialized to storage like HDFS. 

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927217#comment-16927217
 ] 

angerszhu commented on SPARK-29038:
---

[~cltlfcjin]  

*precalculating, alittle like CarbonData's Data map.* 

*Have you implement  the whole matching logic*

 

 

 

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Lantao Jin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927229#comment-16927229
 ] 

Lantao Jin commented on SPARK-29038:


[~angerszhuuu] By default, we use Parquet to storage the data of materialized 
view, but it supports all storage formats Spark supported. We have implemented 
most matching logic about filter, join and aggregate. But it cannot cover all 
scenarios, like JoinBack, since Spark current doesn't support PK or dimensions 
like other DBMS (oracle).

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927243#comment-16927243
 ] 

angerszhu commented on SPARK-29038:
---

I ma interested in the match about :

you create a MV table  q1_mv with group by `l_returnflag, l_linestatus, 
l_shipdate`, 

your query group by `l_returnflag, l_linestatus` , 

This may be the most  complex  place need to be achieved. 

I wanted to do this in my cache framework, but I couldn't find a good way to do 
it.

Can i contact you with wechat.

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Lantao Jin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927256#comment-16927256
 ] 

Lantao Jin commented on SPARK-29038:


[~angerszhuuu]Of course, will contact you offline 

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Dilip Biswal (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927258#comment-16927258
 ] 

Dilip Biswal commented on SPARK-29038:
--

[~cltlfcjin] 

Actually i had similar question as [~mgaido]. We have been writing the SQL 
reference for 3.0 have recently

documented {code} CACHE TABLE {code}  in 
[https://github.com/apache/spark/pull/25532].  So in SPARK, it is
possible to cache the result of a complex query involving joins, aggregates 
etc. 

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Dilip Biswal (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927258#comment-16927258
 ] 

Dilip Biswal edited comment on SPARK-29038 at 9/11/19 5:09 AM:
---

[~cltlfcjin]

Actually i had similar question as [~mgaido]. We have been writing the SQL 
reference for 3.0 and have recently documented
{code:java}
 CACHE TABLE {code}
in [https://github.com/apache/spark/pull/25532].  So in SPARK, it is
 possible to cache the result of a complex query involving joins, aggregates 
etc. 


was (Author: dkbiswal):
[~cltlfcjin] 

Actually i had similar question as [~mgaido]. We have been writing the SQL 
reference for 3.0 and have recently documented {code} CACHE TABLE {code}  in 
[https://github.com/apache/spark/pull/25532].  So in SPARK, it is
possible to cache the result of a complex query involving joins, aggregates 
etc. 

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Dilip Biswal (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927258#comment-16927258
 ] 

Dilip Biswal edited comment on SPARK-29038 at 9/11/19 5:09 AM:
---

[~cltlfcjin] 

Actually i had similar question as [~mgaido]. We have been writing the SQL 
reference for 3.0 and have recently documented {code} CACHE TABLE {code}  in 
[https://github.com/apache/spark/pull/25532].  So in SPARK, it is
possible to cache the result of a complex query involving joins, aggregates 
etc. 


was (Author: dkbiswal):
[~cltlfcjin] 

Actually i had similar question as [~mgaido]. We have been writing the SQL 
reference for 3.0 have recently

documented {code} CACHE TABLE {code}  in 
[https://github.com/apache/spark/pull/25532].  So in SPARK, it is
possible to cache the result of a complex query involving joins, aggregates 
etc. 

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Dilip Biswal (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927258#comment-16927258
 ] 

Dilip Biswal edited comment on SPARK-29038 at 9/11/19 5:13 AM:
---

[~cltlfcjin]

Actually i had similar question as [~mgaido]. We have been writing the SQL 
reference for 3.0 and have recently documented
{code:java}
 CACHE TABLE {code}
in [https://github.com/apache/spark/pull/25532].  So in SPARK, it is
 possible to cache the result of a complex query involving joins, aggregates 
etc, right ?


was (Author: dkbiswal):
[~cltlfcjin]

Actually i had similar question as [~mgaido]. We have been writing the SQL 
reference for 3.0 and have recently documented
{code:java}
 CACHE TABLE {code}
in [https://github.com/apache/spark/pull/25532].  So in SPARK, it is
 possible to cache the result of a complex query involving joins, aggregates 
etc. 

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 112 matches

Mail list logo