[jira] [Assigned] (SPARK-26551) Selecting one complex field and having is null predicate on another complex field can cause error

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26551:


Assignee: (was: Apache Spark)

> Selecting one complex field and having is null predicate on another complex 
> field can cause error
> -
>
> Key: SPARK-26551
> URL: https://issues.apache.org/jira/browse/SPARK-26551
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> The query below can cause error when doing schema pruning:
> {code:java}
> val query = sql("select * from contacts")
>   .where("name.middle is not null")
>   .select(
> "id",
> "name.first",
> "name.middle",
> "name.last"
>   )
>   .where("last = 'Jones'")
>   .select(count("id")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26551) Selecting one complex field and having is null predicate on another complex field can cause error

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26551:


Assignee: Apache Spark

> Selecting one complex field and having is null predicate on another complex 
> field can cause error
> -
>
> Key: SPARK-26551
> URL: https://issues.apache.org/jira/browse/SPARK-26551
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> The query below can cause error when doing schema pruning:
> {code:java}
> val query = sql("select * from contacts")
>   .where("name.middle is not null")
>   .select(
> "id",
> "name.first",
> "name.middle",
> "name.last"
>   )
>   .where("last = 'Jones'")
>   .select(count("id")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26551) Selecting one complex field and having is null predicate on another complex field can cause error

2019-01-05 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-26551:
---

 Summary: Selecting one complex field and having is null predicate 
on another complex field can cause error
 Key: SPARK-26551
 URL: https://issues.apache.org/jira/browse/SPARK-26551
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


The query below can cause error when doing schema pruning:

{code:java}
val query = sql("select * from contacts")
  .where("name.middle is not null")
  .select(
"id",
"name.first",
"name.middle",
"name.last"
  )
  .where("last = 'Jones'")
  .select(count("id")
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26548) Don't block during query optimization

2019-01-05 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26548.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Don't block during query optimization
> -
>
> Key: SPARK-26548
> URL: https://issues.apache.org/jira/browse/SPARK-26548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dave DeCaprio
>Assignee: Dave DeCaprio
>Priority: Minor
>  Labels: sql
> Fix For: 3.0.0
>
>
> In Spark 2.4.0 the CacheManager was updated so it will not execute jobs while 
> it holds a lock.This was introduced in -SPARK-23880.-
> The CacheManager still holds a write lock during the execution of the query 
> optimizer.  For complex queries the optimizer can run for a long time (we see 
> 10-15 minutes for some exceptionally large queries).  This allows only 1 
> thread to optimize at once. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26548) Don't block during query optimization

2019-01-05 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-26548:
---

Assignee: Dave DeCaprio

> Don't block during query optimization
> -
>
> Key: SPARK-26548
> URL: https://issues.apache.org/jira/browse/SPARK-26548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dave DeCaprio
>Assignee: Dave DeCaprio
>Priority: Minor
>  Labels: sql
>
> In Spark 2.4.0 the CacheManager was updated so it will not execute jobs while 
> it holds a lock.This was introduced in -SPARK-23880.-
> The CacheManager still holds a write lock during the execution of the query 
> optimizer.  For complex queries the optimizer can run for a long time (we see 
> 10-15 minutes for some exceptionally large queries).  This allows only 1 
> thread to optimize at once. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26537) update the release scripts to point to gitbox

2019-01-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735060#comment-16735060
 ] 

Dongjoon Hyun commented on SPARK-26537:
---

This is resolved via 
- https://github.com/apache/spark/pull/23454
- https://github.com/apache/spark/pull/23472
- https://github.com/apache/spark/pull/23473

> update the release scripts to point to gitbox
> -
>
> Key: SPARK-26537
> URL: https://issues.apache.org/jira/browse/SPARK-26537
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>
>
> we're seeing packaging build failures like this:  
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console]
> i did a quick skim through the repo, and found the offending urls to the old 
> apache git repos:
>  
> {code:java}
> (py35) ➜ spark git:(update-apache-repo) grep -r git-wip *
> dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git"
> dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git";
> dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git";
> pom.xml: 
> scm:git:https://git-wip-us.apache.org/repos/asf/spark.git
> {code}
> this affects all versions of spark, so it will need to be backported to all 
> released versions.
> i'll put together a pull request later today.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26537) update the release scripts to point to gitbox

2019-01-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26537.
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1
   2.3.3
   2.2.3

> update the release scripts to point to gitbox
> -
>
> Key: SPARK-26537
> URL: https://issues.apache.org/jira/browse/SPARK-26537
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>
>
> we're seeing packaging build failures like this:  
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console]
> i did a quick skim through the repo, and found the offending urls to the old 
> apache git repos:
>  
> {code:java}
> (py35) ➜ spark git:(update-apache-repo) grep -r git-wip *
> dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git"
> dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git";
> dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git";
> pom.xml: 
> scm:git:https://git-wip-us.apache.org/repos/asf/spark.git
> {code}
> this affects all versions of spark, so it will need to be backported to all 
> released versions.
> i'll put together a pull request later today.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734934#comment-16734934
 ] 

Dongjoon Hyun edited comment on SPARK-25692 at 1/6/19 12:00 AM:


Hi, [~zsxwing] and [~tgraves]. 

While looking other failures, I notice that this failure still happens 
frequently. 

The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might 
be related.

- [master 
5829|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5829/testReport]
 (amp-jenkins-worker-05)
- [master 
5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport]
 (amp-jenkins-worker-05)
- [master 
5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport]
 (amp-jenkins-worker-05)
- [master 
5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100787|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100787/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100788|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100788/consoleFull]
 (amp-jenkins-worker-05)


was (Author: dongjoon):
Hi, [~zsxwing] and [~tgraves]. 

While looking other failures, I notice that this failure still happens 
frequently. 

The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might 
be related.

- - [master 
5829|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5829/testReport]
 (amp-jenkins-worker-05)
- [master 
5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport]
 (amp-jenkins-worker-05)
- [master 
5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport]
 (amp-jenkins-worker-05)
- [master 
5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100787|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100787/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100788|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100788/consoleFull]
 (amp-jenkins-worker-05)

> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734934#comment-16734934
 ] 

Dongjoon Hyun edited comment on SPARK-25692 at 1/5/19 11:59 PM:


Hi, [~zsxwing] and [~tgraves]. 

While looking other failures, I notice that this failure still happens 
frequently. 

The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might 
be related.

- - [master 
5829|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5829/testReport]
 (amp-jenkins-worker-05)
- [master 
5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport]
 (amp-jenkins-worker-05)
- [master 
5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport]
 (amp-jenkins-worker-05)
- [master 
5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100787|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100787/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100788|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100788/consoleFull]
 (amp-jenkins-worker-05)


was (Author: dongjoon):
Hi, [~zsxwing] and [~tgraves]. 

While looking other failures, I notice that this failure still happens 
frequently. 

The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might 
be related.

- [master 
5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport]
 (amp-jenkins-worker-05)
- [master 
5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport]
 (amp-jenkins-worker-05)
- [master 
5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100787|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100787/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100788|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100788/consoleFull]
 (amp-jenkins-worker-05)

> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26535) Parsing literals as DOUBLE instead of DECIMAL

2019-01-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26535.
---
Resolution: Won't Do

Hi, [~mgaido]. 

First of all, Hive starts to use `Decimal` by default. Also, this introduces 
TPCDS-H query result difference between Spark versions. We cannot do this.

{code}
hive> select version();
OK
3.1.1 rf4e0529634b6231a0072295da48af466cf2f10b7
Time taken: 0.089 seconds, Fetched: 1 row(s)
hive> explain select 2.3;
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
TableScan
  alias: _dummy_table
  Row Limit Per Split: 1
  Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column 
stats: COMPLETE
  Select Operator
expressions: 2.3 (type: decimal(2,1))
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column 
stats: COMPLETE
ListSink
{code}

> Parsing literals as DOUBLE instead of DECIMAL
> -
>
> Key: SPARK-26535
> URL: https://issues.apache.org/jira/browse/SPARK-26535
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> As pointed out by [~dkbiswal]'s comment 
> https://github.com/apache/spark/pull/22450#issuecomment-423082389, most of 
> other RDBMS (DB2, Presto, Hive, MSSQL) consider literals as DOUBLE by default.
> Spark as of now consider them as DECIMAL. This is quite problematic 
> especially in relation with the operations on decimal, for which we base our 
> implementation on Hive/MSSQL.
> So this ticket is for moving by default the resolution of literals as DOUBLE, 
> but with a config which allows to get back to the previous behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26402) Accessing nested fields with different cases in case insensitive mode

2019-01-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735046#comment-16735046
 ] 

Dongjoon Hyun commented on SPARK-26402:
---

Hi, [~smilegator]. This is not a correctness issue because it fails with 
AnalysisException previously.

> Accessing nested fields with different cases in case insensitive mode
> -
>
> Key: SPARK-26402
> URL: https://issues.apache.org/jira/browse/SPARK-26402
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> {{GetStructField}} with different optional names should be semantically 
> equal. We will use this as building block to compare the nested fields used 
> in the plans to be optimized by catalyst optimizer.
> This PR also fixes a bug below that accessing nested fields with different 
> cases in case insensitive mode will result result {{AnalysisException}}.
> {code:java}
> sql("create table t (s struct) using json")
> sql("select s.I from t group by s.i")
> {code}
> which is currently failing
> {code:java}
> org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither 
> present in the group by, nor is it an aggregate function
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26402) Accessing nested fields with different cases in case insensitive mode

2019-01-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26402.
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1

This is resolved via https://github.com/apache/spark/pull/23353

> Accessing nested fields with different cases in case insensitive mode
> -
>
> Key: SPARK-26402
> URL: https://issues.apache.org/jira/browse/SPARK-26402
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> {{GetStructField}} with different optional names should be semantically 
> equal. We will use this as building block to compare the nested fields used 
> in the plans to be optimized by catalyst optimizer.
> This PR also fixes a bug below that accessing nested fields with different 
> cases in case insensitive mode will result result {{AnalysisException}}.
> {code:java}
> sql("create table t (s struct) using json")
> sql("select s.I from t group by s.i")
> {code}
> which is currently failing
> {code:java}
> org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither 
> present in the group by, nor is it an aggregate function
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-26535) Parsing literals as DOUBLE instead of DECIMAL

2019-01-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-26535.
-

> Parsing literals as DOUBLE instead of DECIMAL
> -
>
> Key: SPARK-26535
> URL: https://issues.apache.org/jira/browse/SPARK-26535
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> As pointed out by [~dkbiswal]'s comment 
> https://github.com/apache/spark/pull/22450#issuecomment-423082389, most of 
> other RDBMS (DB2, Presto, Hive, MSSQL) consider literals as DOUBLE by default.
> Spark as of now consider them as DECIMAL. This is quite problematic 
> especially in relation with the operations on decimal, for which we base our 
> implementation on Hive/MSSQL.
> So this ticket is for moving by default the resolution of literals as DOUBLE, 
> but with a config which allows to get back to the previous behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26550) New datasource for benchmarking

2019-01-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735040#comment-16735040
 ] 

Dongjoon Hyun commented on SPARK-26550:
---

The PR title looks more intuitive to me.

> New datasource for benchmarking
> ---
>
> Key: SPARK-26550
> URL: https://issues.apache.org/jira/browse/SPARK-26550
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Purpose of new datasource is materialisation of dataset without additional 
> overhead associated with actions and converting row's values to other types. 
> This can be used in benchmarking as well as in cases when need to materialise 
> a dataset for side effects like in caching.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26537) update the release scripts to point to gitbox

2019-01-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735038#comment-16735038
 ] 

Dongjoon Hyun commented on SPARK-26537:
---

I guess we can skip `branch-1.6/2.0/2.1` because those branches are EOL and the 
Jenkins jobs are stopped for a while.

> update the release scripts to point to gitbox
> -
>
> Key: SPARK-26537
> URL: https://issues.apache.org/jira/browse/SPARK-26537
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> we're seeing packaging build failures like this:  
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console]
> i did a quick skim through the repo, and found the offending urls to the old 
> apache git repos:
>  
> {code:java}
> (py35) ➜ spark git:(update-apache-repo) grep -r git-wip *
> dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git"
> dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git";
> dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git";
> pom.xml: 
> scm:git:https://git-wip-us.apache.org/repos/asf/spark.git
> {code}
> this affects all versions of spark, so it will need to be backported to all 
> released versions.
> i'll put together a pull request later today.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26537) update the release scripts to point to gitbox

2019-01-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26537:
--
Target Version/s: 2.2.3, 2.3.3, 2.4.1, 3.0.0

> update the release scripts to point to gitbox
> -
>
> Key: SPARK-26537
> URL: https://issues.apache.org/jira/browse/SPARK-26537
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> we're seeing packaging build failures like this:  
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console]
> i did a quick skim through the repo, and found the offending urls to the old 
> apache git repos:
>  
> {code:java}
> (py35) ➜ spark git:(update-apache-repo) grep -r git-wip *
> dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git"
> dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git";
> dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git";
> pom.xml: 
> scm:git:https://git-wip-us.apache.org/repos/asf/spark.git
> {code}
> this affects all versions of spark, so it will need to be backported to all 
> released versions.
> i'll put together a pull request later today.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26537) update the release scripts to point to gitbox

2019-01-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26537:
--
Affects Version/s: 2.2.0

> update the release scripts to point to gitbox
> -
>
> Key: SPARK-26537
> URL: https://issues.apache.org/jira/browse/SPARK-26537
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> we're seeing packaging build failures like this:  
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console]
> i did a quick skim through the repo, and found the offending urls to the old 
> apache git repos:
>  
> {code:java}
> (py35) ➜ spark git:(update-apache-repo) grep -r git-wip *
> dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git"
> dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git";
> dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git";
> pom.xml: 
> scm:git:https://git-wip-us.apache.org/repos/asf/spark.git
> {code}
> this affects all versions of spark, so it will need to be backported to all 
> released versions.
> i'll put together a pull request later today.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26545) Fix typo in EqualNullSafe's truth table comment

2019-01-05 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26545.
-
   Resolution: Fixed
 Assignee: Kris Mok
Fix Version/s: 3.0.0
   2.4.1
   2.3.3
   2.2.3

> Fix typo in EqualNullSafe's truth table comment
> ---
>
> Key: SPARK-26545
> URL: https://issues.apache.org/jira/browse/SPARK-26545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Trivial
> Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>
>
> The truth table comment in {{EqualNullSafe}} incorrectly marked FALSE results 
> as UNKNOWN



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26550) New datasource for benchmarking

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26550:


Assignee: (was: Apache Spark)

> New datasource for benchmarking
> ---
>
> Key: SPARK-26550
> URL: https://issues.apache.org/jira/browse/SPARK-26550
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Purpose of new datasource is materialisation of dataset without additional 
> overhead associated with actions and converting row's values to other types. 
> This can be used in benchmarking as well as in cases when need to materialise 
> a dataset for side effects like in caching.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26550) New datasource for benchmarking

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26550:


Assignee: Apache Spark

> New datasource for benchmarking
> ---
>
> Key: SPARK-26550
> URL: https://issues.apache.org/jira/browse/SPARK-26550
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Purpose of new datasource is materialisation of dataset without additional 
> overhead associated with actions and converting row's values to other types. 
> This can be used in benchmarking as well as in cases when need to materialise 
> a dataset for side effects like in caching.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26550) New datasource for benchmarking

2019-01-05 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26550:
--

 Summary: New datasource for benchmarking
 Key: SPARK-26550
 URL: https://issues.apache.org/jira/browse/SPARK-26550
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Purpose of new datasource is materialisation of dataset without additional 
overhead associated with actions and converting row's values to other types. 
This can be used in benchmarking as well as in cases when need to materialise a 
dataset for side effects like in caching.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26549) PySpark worker reuse take no effect for Python3

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26549:


Assignee: Apache Spark

> PySpark worker reuse take no effect for Python3
> ---
>
> Key: SPARK-26549
> URL: https://issues.apache.org/jira/browse/SPARK-26549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Apache Spark
>Priority: Major
>
> During [the follow-up 
> work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
> PySpark worker reuse scenario, we found that the worker reuse takes no effect 
> for Python3 while works properly for Python2 and PyPy.
> It happened because, during the python worker check end of the stream in 
> Python3, we got an unexpected value -1 here which refers to 
> END_OF_DATA_SECTION. See the code in worker.py:
> {code:python}
> # check end of stream
> if read_int(infile) == SpecialLengths.END_OF_STREAM:
> write_int(SpecialLengths.END_OF_STREAM, outfile)
> else:
> # write a different value to tell JVM to not reuse this worker
> write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
> sys.exit(-1)
> {code}
> The code works well for Python2 and PyPy because the END_OF_DATA_SECTION has 
> been handled during load iterator from the socket stream, see the code in 
> FramedSerializer:
> {code:python}
> def load_stream(self, stream):
> while True:
> try:
> yield self._read_with_length(stream)
> except EOFError:
> return
> ...
> def _read_with_length(self, stream):
> length = read_int(stream)
> if length == SpecialLengths.END_OF_DATA_SECTION:
> raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
> load_stream
> elif length == SpecialLengths.NULL:
> return None
> obj = stream.read(length)
> if len(obj) < length:
> raise EOFError
> return self.loads(obj)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26549) PySpark worker reuse take no effect for Python3

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26549:


Assignee: (was: Apache Spark)

> PySpark worker reuse take no effect for Python3
> ---
>
> Key: SPARK-26549
> URL: https://issues.apache.org/jira/browse/SPARK-26549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> During [the follow-up 
> work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
> PySpark worker reuse scenario, we found that the worker reuse takes no effect 
> for Python3 while works properly for Python2 and PyPy.
> It happened because, during the python worker check end of the stream in 
> Python3, we got an unexpected value -1 here which refers to 
> END_OF_DATA_SECTION. See the code in worker.py:
> {code:python}
> # check end of stream
> if read_int(infile) == SpecialLengths.END_OF_STREAM:
> write_int(SpecialLengths.END_OF_STREAM, outfile)
> else:
> # write a different value to tell JVM to not reuse this worker
> write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
> sys.exit(-1)
> {code}
> The code works well for Python2 and PyPy because the END_OF_DATA_SECTION has 
> been handled during load iterator from the socket stream, see the code in 
> FramedSerializer:
> {code:python}
> def load_stream(self, stream):
> while True:
> try:
> yield self._read_with_length(stream)
> except EOFError:
> return
> ...
> def _read_with_length(self, stream):
> length = read_int(stream)
> if length == SpecialLengths.END_OF_DATA_SECTION:
> raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
> load_stream
> elif length == SpecialLengths.NULL:
> return None
> obj = stream.read(length)
> if len(obj) < length:
> raise EOFError
> return self.loads(obj)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26457) Show hadoop configurations in HistoryServer environment tab

2019-01-05 Thread Pablo Langa Blanco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734986#comment-16734986
 ] 

Pablo Langa Blanco commented on SPARK-26457:


Hi [~deshanxiao],

What configurations are you thinking about? Could you explain cases where this 
information could be relevant. I'm thinking in the case that you are working 
with yarn, yarn has all the information about hadoop that we could need about 
the spark job, you dont need it duplicated in History server. 

Thanks!

> Show hadoop configurations in HistoryServer environment tab
> ---
>
> Key: SPARK-26457
> URL: https://issues.apache.org/jira/browse/SPARK-26457
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2, 2.4.0
> Environment: Maybe it is good to show some configurations in 
> HistoryServer environment tab for debugging some bugs about hadoop
>Reporter: deshanxiao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25917) Spark UI's executors page loads forever when memoryMetrics in None. Fix is to JSON ignore memorymetrics when it is None.

2019-01-05 Thread Pablo Langa Blanco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734984#comment-16734984
 ] 

Pablo Langa Blanco commented on SPARK-25917:


The pull request was closed because the problem has been solved already, the 
issue should be closed too.

> Spark UI's executors page loads forever when memoryMetrics in None. Fix is to 
> JSON ignore memorymetrics when it is None.
> 
>
> Key: SPARK-25917
> URL: https://issues.apache.org/jira/browse/SPARK-25917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: Rong Tang
>Priority: Major
>
> Spark UI's executors page loads forever when memoryMetrics in None. Fix is to 
> JSON ignore memorymetrics when it is None.
> ## How was this patch tested?
> Before fix: (loads forever)
> ![image](https://user-images.githubusercontent.com/1785565/47875681-64dfe480-ddd4-11e8-8d15-5ed1457bc24f.png)
> After fix:
> ![image](https://user-images.githubusercontent.com/1785565/47875691-6b6e5c00-ddd4-11e8-9895-db8dd9730ee1.png)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26549) PySpark worker reuse take no effect for Python3

2019-01-05 Thread Yuanjian Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-26549:

Description: 
During [the follow-up 
work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
PySpark worker reuse scenario, we found that the worker reuse takes no effect 
for Python3 while works properly for Python2 and PyPy.
It happened because, during the python worker check end of the stream in 
Python3, we got an unexpected value -1 here which refers to 
END_OF_DATA_SECTION. See the code in worker.py:
{code:python}
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
sys.exit(-1)
{code}
The code works well for Python2 and PyPy cause the END_OF_DATA_SECTION has been 
handled during load iterator from the socket stream, see the code in 
FramedSerializer:

{code:python}
def load_stream(self, stream):
while True:
try:
yield self._read_with_length(stream)
except EOFError:
return

...

def _read_with_length(self, stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
load_stream
elif length == SpecialLengths.NULL:
return None
obj = stream.read(length)
if len(obj) < length:
raise EOFError
return self.loads(obj)
{code}



  was:
During [the follow-up 
work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
PySpark worker reuse scenario, we found that the worker reuse takes no effect 
for Python3 while works properly for Python2 and PyPy.
It happened because, during the python worker check end of the stream in 
Python3, we got an unexpected value -1 here which refers to 
END_OF_DATA_SECTION. See the code in worker.py:
{code:python}
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
sys.exit(-1)
{code}
The code works well for Python2 and PyPy cause the END_OF_DATA_SECTION has been 
handled during load iterator from the socket stream, see the code in 
FramedSerializer:

{code:python}
def load_stream(self, stream):
while True:
try:
yield self._read_with_length(stream)
except EOFError:
return
...

def _read_with_length(self, stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
load_stream
elif length == SpecialLengths.NULL:
return None
obj = stream.read(length)
if len(obj) < length:
raise EOFError
return self.loads(obj)
{code}




> PySpark worker reuse take no effect for Python3
> ---
>
> Key: SPARK-26549
> URL: https://issues.apache.org/jira/browse/SPARK-26549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> During [the follow-up 
> work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
> PySpark worker reuse scenario, we found that the worker reuse takes no effect 
> for Python3 while works properly for Python2 and PyPy.
> It happened because, during the python worker check end of the stream in 
> Python3, we got an unexpected value -1 here which refers to 
> END_OF_DATA_SECTION. See the code in worker.py:
> {code:python}
> # check end of stream
> if read_int(infile) == SpecialLengths.END_OF_STREAM:
> write_int(SpecialLengths.END_OF_STREAM, outfile)
> else:
> # write a different value to tell JVM to not reuse this worker
> write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
> sys.exit(-1)
> {code}
> The code works well for Python2 and PyPy cause the END_OF_DATA_SECTION has 
> been handled during load iterator from the socket stream, see the code in 
> FramedSerializer:
> {code:python}
> def load_stream(self, stream):
> while True:
> try:
> yield self._read_with_length(stream)
> except EOFError:
> return
> ...
> def _read_with_length(self, stream):
> length = read_int(stream)
> if length == SpecialLengths.END_OF_DATA_SECTION:
> raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
> load_stream
> elif length == SpecialLengths.NULL:
> return None
> obj = stream.read(length)
> if le

[jira] [Updated] (SPARK-26549) PySpark worker reuse take no effect for Python3

2019-01-05 Thread Yuanjian Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-26549:

Description: 
During [the follow-up 
work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
PySpark worker reuse scenario, we found that the worker reuse takes no effect 
for Python3 while works properly for Python2 and PyPy.
It happened because, during the python worker check end of the stream in 
Python3, we got an unexpected value -1 here which refers to 
END_OF_DATA_SECTION. See the code in worker.py:
{code:python}
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
sys.exit(-1)
{code}
The code works well for Python2 and PyPy because the END_OF_DATA_SECTION has 
been handled during load iterator from the socket stream, see the code in 
FramedSerializer:

{code:python}
def load_stream(self, stream):
while True:
try:
yield self._read_with_length(stream)
except EOFError:
return

...

def _read_with_length(self, stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
load_stream
elif length == SpecialLengths.NULL:
return None
obj = stream.read(length)
if len(obj) < length:
raise EOFError
return self.loads(obj)
{code}



  was:
During [the follow-up 
work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
PySpark worker reuse scenario, we found that the worker reuse takes no effect 
for Python3 while works properly for Python2 and PyPy.
It happened because, during the python worker check end of the stream in 
Python3, we got an unexpected value -1 here which refers to 
END_OF_DATA_SECTION. See the code in worker.py:
{code:python}
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
sys.exit(-1)
{code}
The code works well for Python2 and PyPy cause the END_OF_DATA_SECTION has been 
handled during load iterator from the socket stream, see the code in 
FramedSerializer:

{code:python}
def load_stream(self, stream):
while True:
try:
yield self._read_with_length(stream)
except EOFError:
return

...

def _read_with_length(self, stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
load_stream
elif length == SpecialLengths.NULL:
return None
obj = stream.read(length)
if len(obj) < length:
raise EOFError
return self.loads(obj)
{code}




> PySpark worker reuse take no effect for Python3
> ---
>
> Key: SPARK-26549
> URL: https://issues.apache.org/jira/browse/SPARK-26549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> During [the follow-up 
> work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
> PySpark worker reuse scenario, we found that the worker reuse takes no effect 
> for Python3 while works properly for Python2 and PyPy.
> It happened because, during the python worker check end of the stream in 
> Python3, we got an unexpected value -1 here which refers to 
> END_OF_DATA_SECTION. See the code in worker.py:
> {code:python}
> # check end of stream
> if read_int(infile) == SpecialLengths.END_OF_STREAM:
> write_int(SpecialLengths.END_OF_STREAM, outfile)
> else:
> # write a different value to tell JVM to not reuse this worker
> write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
> sys.exit(-1)
> {code}
> The code works well for Python2 and PyPy because the END_OF_DATA_SECTION has 
> been handled during load iterator from the socket stream, see the code in 
> FramedSerializer:
> {code:python}
> def load_stream(self, stream):
> while True:
> try:
> yield self._read_with_length(stream)
> except EOFError:
> return
> ...
> def _read_with_length(self, stream):
> length = read_int(stream)
> if length == SpecialLengths.END_OF_DATA_SECTION:
> raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
> load_stream
> elif length == SpecialLengths.NULL:
> return None
> obj = stream.read(length)
> if len(obj) < length:
> raise EOFError
> return self.loa

[jira] [Created] (SPARK-26549) PySpark worker reuse take no effect for Python3

2019-01-05 Thread Yuanjian Li (JIRA)
Yuanjian Li created SPARK-26549:
---

 Summary: PySpark worker reuse take no effect for Python3
 Key: SPARK-26549
 URL: https://issues.apache.org/jira/browse/SPARK-26549
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Yuanjian Li


During [the follow-up 
work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for 
PySpark worker reuse scenario, we found that the worker reuse takes no effect 
for Python3 while works properly for Python2 and PyPy.
It happened because, during the python worker check end of the stream in 
Python3, we got an unexpected value -1 here which refers to 
END_OF_DATA_SECTION. See the code in worker.py:
{code:python}
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
sys.exit(-1)
{code}
The code works well for Python2 and PyPy cause the END_OF_DATA_SECTION has been 
handled during load iterator from the socket stream, see the code in 
FramedSerializer:

{code:python}
def load_stream(self, stream):
while True:
try:
yield self._read_with_length(stream)
except EOFError:
return
...

def _read_with_length(self, stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in 
load_stream
elif length == SpecialLengths.NULL:
return None
obj = stream.read(length)
if len(obj) < length:
raise EOFError
return self.loads(obj)
{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26548) Don't block during query optimization

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26548:


Assignee: (was: Apache Spark)

> Don't block during query optimization
> -
>
> Key: SPARK-26548
> URL: https://issues.apache.org/jira/browse/SPARK-26548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dave DeCaprio
>Priority: Minor
>  Labels: sql
>
> In Spark 2.4.0 the CacheManager was updated so it will not execute jobs while 
> it holds a lock.This was introduced in -SPARK-23880.-
> The CacheManager still holds a write lock during the execution of the query 
> optimizer.  For complex queries the optimizer can run for a long time (we see 
> 10-15 minutes for some exceptionally large queries).  This allows only 1 
> thread to optimize at once. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26548) Don't block during query optimization

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26548:


Assignee: Apache Spark

> Don't block during query optimization
> -
>
> Key: SPARK-26548
> URL: https://issues.apache.org/jira/browse/SPARK-26548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dave DeCaprio
>Assignee: Apache Spark
>Priority: Minor
>  Labels: sql
>
> In Spark 2.4.0 the CacheManager was updated so it will not execute jobs while 
> it holds a lock.This was introduced in -SPARK-23880.-
> The CacheManager still holds a write lock during the execution of the query 
> optimizer.  For complex queries the optimizer can run for a long time (we see 
> 10-15 minutes for some exceptionally large queries).  This allows only 1 
> thread to optimize at once. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26535) Parsing literals as DOUBLE instead of DECIMAL

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26535:


Assignee: (was: Apache Spark)

> Parsing literals as DOUBLE instead of DECIMAL
> -
>
> Key: SPARK-26535
> URL: https://issues.apache.org/jira/browse/SPARK-26535
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> As pointed out by [~dkbiswal]'s comment 
> https://github.com/apache/spark/pull/22450#issuecomment-423082389, most of 
> other RDBMS (DB2, Presto, Hive, MSSQL) consider literals as DOUBLE by default.
> Spark as of now consider them as DECIMAL. This is quite problematic 
> especially in relation with the operations on decimal, for which we base our 
> implementation on Hive/MSSQL.
> So this ticket is for moving by default the resolution of literals as DOUBLE, 
> but with a config which allows to get back to the previous behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26535) Parsing literals as DOUBLE instead of DECIMAL

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26535:


Assignee: Apache Spark

> Parsing literals as DOUBLE instead of DECIMAL
> -
>
> Key: SPARK-26535
> URL: https://issues.apache.org/jira/browse/SPARK-26535
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Major
>
> As pointed out by [~dkbiswal]'s comment 
> https://github.com/apache/spark/pull/22450#issuecomment-423082389, most of 
> other RDBMS (DB2, Presto, Hive, MSSQL) consider literals as DOUBLE by default.
> Spark as of now consider them as DECIMAL. This is quite problematic 
> especially in relation with the operations on decimal, for which we base our 
> implementation on Hive/MSSQL.
> So this ticket is for moving by default the resolution of literals as DOUBLE, 
> but with a config which allows to get back to the previous behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734934#comment-16734934
 ] 

Dongjoon Hyun edited comment on SPARK-25692 at 1/5/19 4:55 PM:
---

Hi, [~zsxwing] and [~tgraves]. 

While looking other failures, I notice that this failure still happens 
frequently. 

The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might 
be related.

- [master 
5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport]
 (amp-jenkins-worker-05)
- [master 
5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport]
 (amp-jenkins-worker-05)
- [master 
5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100787|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100787/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100788|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100788/consoleFull]
 (amp-jenkins-worker-05)


was (Author: dongjoon):
Hi, [~zsxwing] and [~tgraves]. 

While looking other failures, I notice that this failure still happens 
frequently. 

The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might 
be related.

- [master 
5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport]
 (amp-jenkins-worker-05)
- [master 
5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport]
 (amp-jenkins-worker-05)
- [master 
5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull]
 (amp-jenkins-worker-05)


> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734934#comment-16734934
 ] 

Dongjoon Hyun edited comment on SPARK-25692 at 1/5/19 4:53 PM:
---

Hi, [~zsxwing] and [~tgraves]. 

While looking other failures, I notice that this failure still happens 
frequently. 

The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might 
be related.

- [master 
5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport]
 (amp-jenkins-worker-05)
- [master 
5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport]
 (amp-jenkins-worker-05)
- [master 
5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull]
 (amp-jenkins-worker-05)



was (Author: dongjoon):
Hi, [~zsxwing] and [~tgraves]. 

While looking other failures, I notice that this failure still happens 
frequently in Maven testing. 

The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might 
be related.

- [master 
5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport]
 (amp-jenkins-worker-05)
- [master 
5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport]
 (amp-jenkins-worker-05)
- [master 
5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull]
 (amp-jenkins-worker-05)


> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734934#comment-16734934
 ] 

Dongjoon Hyun edited comment on SPARK-25692 at 1/5/19 4:52 PM:
---

Hi, [~zsxwing] and [~tgraves]. 

While looking other failures, I notice that this failure still happens 
frequently in Maven testing. 

The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might 
be related.

- [master 
5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport]
 (amp-jenkins-worker-05)
- [master 
5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport]
 (amp-jenkins-worker-05)
- [master 
5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull]
 (amp-jenkins-worker-05)

- [SparkPullRequestBuilder 
100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull]
 (amp-jenkins-worker-05)



was (Author: dongjoon):
Hi, [~zsxwing] and [~tgraves]. 

While looking other failures, I notice that this failure still happens 
frequently in Maven testing. 

The failure is always `fetchBothChunks`. Can we increase the timeout from 5 
second to 10 (or 20) second? Does it hide the underlying real issue?

- 
[5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport]
- 
[5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport]
- 
[5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport]

> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25692:


Assignee: Apache Spark

> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Blocker
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25692:


Assignee: (was: Apache Spark)

> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26547) Remove duplicate toHiveString from HiveUtils

2019-01-05 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26547:
--

 Summary: Remove duplicate toHiveString from HiveUtils
 Key: SPARK-26547
 URL: https://issues.apache.org/jira/browse/SPARK-26547
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


The toHiveString method is already implemented in the HiveResult object. The 
method can be removed from HiveUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26548) Don't block during query optimization

2019-01-05 Thread Dave DeCaprio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734935#comment-16734935
 ] 

Dave DeCaprio commented on SPARK-26548:
---

I have a fix and am creating a PR for this.

> Don't block during query optimization
> -
>
> Key: SPARK-26548
> URL: https://issues.apache.org/jira/browse/SPARK-26548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dave DeCaprio
>Priority: Minor
>  Labels: sql
>
> In Spark 2.4.0 the CacheManager was updated so it will not execute jobs while 
> it holds a lock.This was introduced in -SPARK-23880.-
> The CacheManager still holds a write lock during the execution of the query 
> optimizer.  For complex queries the optimizer can run for a long time (we see 
> 10-15 minutes for some exceptionally large queries).  This allows only 1 
> thread to optimize at once. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25692:
--
Summary: Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks  (was: 
Flaky test: ChunkFetchIntegrationSuite)

> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734934#comment-16734934
 ] 

Dongjoon Hyun commented on SPARK-25692:
---

Hi, [~zsxwing] and [~tgraves]. 

While looking other failures, I notice that this failure still happens 
frequently in Maven testing. 

The failure is always `fetchBothChunks`. Can we increase the timeout from 5 
second to 10 (or 20) second? Does it hide the underlying real issue?

- 
[5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport]
- 
[5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport]
- 
[5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport]

> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26548) Don't block during query optimization

2019-01-05 Thread Dave DeCaprio (JIRA)
Dave DeCaprio created SPARK-26548:
-

 Summary: Don't block during query optimization
 Key: SPARK-26548
 URL: https://issues.apache.org/jira/browse/SPARK-26548
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Dave DeCaprio


In Spark 2.4.0 the CacheManager was updated so it will not execute jobs while 
it holds a lock.This was introduced in -SPARK-23880.-

The CacheManager still holds a write lock during the execution of the query 
optimizer.  For complex queries the optimizer can run for a long time (we see 
10-15 minutes for some exceptionally large queries).  This allows only 1 thread 
to optimize at once. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26547) Remove duplicate toHiveString from HiveUtils

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26547:


Assignee: Apache Spark

> Remove duplicate toHiveString from HiveUtils
> 
>
> Key: SPARK-26547
> URL: https://issues.apache.org/jira/browse/SPARK-26547
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> The toHiveString method is already implemented in the HiveResult object. The 
> method can be removed from HiveUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26547) Remove duplicate toHiveString from HiveUtils

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26547:


Assignee: (was: Apache Spark)

> Remove duplicate toHiveString from HiveUtils
> 
>
> Key: SPARK-26547
> URL: https://issues.apache.org/jira/browse/SPARK-26547
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The toHiveString method is already implemented in the HiveResult object. The 
> method can be removed from HiveUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26540) Support PostgreSQL numeric arrays without precision/scale

2019-01-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734932#comment-16734932
 ] 

Dongjoon Hyun commented on SPARK-26540:
---

[~mgaido] requested to close this because SPARK-26538 is created first and has 
PR before this. Please see the PR. I closed mine.

> Support PostgreSQL numeric arrays without precision/scale
> -
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26540) Support PostgreSQL numeric arrays without precision/scale

2019-01-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26540.
---
Resolution: Duplicate

> Support PostgreSQL numeric arrays without precision/scale
> -
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26280) Spark will read entire CSV file even when limit is used

2019-01-05 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26280.
--
Resolution: Duplicate

> Spark will read entire CSV file even when limit is used
> ---
>
> Key: SPARK-26280
> URL: https://issues.apache.org/jira/browse/SPARK-26280
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Amir Bar-Or
>Priority: Major
>
> When you read CSV as below , the parser still waste time and read the entire 
> file:
> var lineDF1 = spark.read
>  .format("com.databricks.spark.csv")
>  .option("header", "true") //reading the headers
>  .option("mode", "DROPMALFORMED")
>  .option("delimiter",",")
>  .option("inferSchema", "false")
>  .schema(line_schema)
>  .load(i_lineitem)
>  .lineDF1.limit(10)
>  
> Even though a  LocalLimit is created , this does not stop the FileScan and 
> the parser from parsing entire file.   Is it possible to push the limit down 
> and stop the parsing ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26336) left_anti join with Na Values

2019-01-05 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26336.
--
Resolution: Invalid

> left_anti join with Na Values
> -
>
> Key: SPARK-26336
> URL: https://issues.apache.org/jira/browse/SPARK-26336
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Carlos
>Priority: Major
>
> When I'm joining two dataframes with data that haves NA values, the left_anti 
> join don't work as well, cause don't detect registers with NA values.
> Example:  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import *
> spark = SparkSession.builder.appName('test').enableHiveSupport().getOrCreate()
> data = [(1,"Test"),(2,"Test"),(3,None)]
> df1 = spark.createDataFrame(data,("id","columndata"))
> df2 = spark.createDataFrame(data,("id","columndata"))
> df_joined = df1.join(df2, df1.columns,'left_anti'){code}
> df_joined have data, when two dataframe are the same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-26542) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-26542.
-

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26542
> URL: https://issues.apache.org/jira/browse/SPARK-26542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
>
> For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will  introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records  0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
> action is  unreasonable as targetPostShuffleInputSize Should not be set too 
> large. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2019-01-05 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26155.
--
Resolution: Duplicate

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql, tpcds.result.xlsx
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-05 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26543:
-
Target Version/s:   (was: 2.3.0)

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: SPARK-26543.patch, image-2019-01-05-13-18-30-487.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !image-2019-01-05-13-18-30-487.png!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-05 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734916#comment-16734916
 ] 

Hyukjin Kwon commented on SPARK-26543:
--

Also, Spark doesn't use patch but PRs. Please take a look for 
https://spark.apache.org/contributing.html

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Attachments: SPARK-26543.patch, image-2019-01-05-13-18-30-487.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !image-2019-01-05-13-18-30-487.png!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-05 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26543:
-
Fix Version/s: (was: 2.3.0)

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Attachments: SPARK-26543.patch, image-2019-01-05-13-18-30-487.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !image-2019-01-05-13-18-30-487.png!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-05 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734914#comment-16734914
 ] 

Hyukjin Kwon commented on SPARK-26543:
--

Please avoid to set target version which is usually reserved for committers, 
and fix version which is usually set when it's actually fixed.

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Attachments: SPARK-26543.patch, image-2019-01-05-13-18-30-487.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !image-2019-01-05-13-18-30-487.png!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26542) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-05 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26542.
--
Resolution: Duplicate

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26542
> URL: https://issues.apache.org/jira/browse/SPARK-26542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
>
> For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will  introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records  0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
> action is  unreasonable as targetPostShuffleInputSize Should not be set too 
> large. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26383) NPE when use DataFrameReader.jdbc with wrong URL

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26383:


Assignee: Apache Spark

> NPE when use DataFrameReader.jdbc with wrong URL
> 
>
> Key: SPARK-26383
> URL: https://issues.apache.org/jira/browse/SPARK-26383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: clouds
>Assignee: Apache Spark
>Priority: Minor
>
> When passing wrong url to jdbc:
> {code:java}
> val opts = Map(
>   "url" -> "jdbc:mysql://localhost/db",
>   "dbtable" -> "table",
>   "driver" -> "org.postgresql.Driver"
> )
> var df = spark.read.format("jdbc").options(opts).load
> {code}
> It would throw an NPE instead of complaining about connection failed. (Note 
> url and driver not matched here)
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
> {code}
> as [postgresql jdbc driver 
> document|https://jdbc.postgresql.org/development/privateapi/org/postgresql/Driver.html#connect-java.lang.String-java.util.Properties-]
>  saying, The driver should return "null" if it realizes it is the wrong kind 
> of driver to connect to the given URL.
> while 
> [ConnectionFactory|https://github.com/apache/spark/blob/e743e848484bf7d97e1b4f33ea83f8520ae7da04/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L56]
>  would not check if conn is null.
> {code:java}
> val conn: Connection = JdbcUtils.createConnectionFactory(options)()
> {code}
>  and trying to close the conn anyway
> {code:java}
> try {
>   ...
> } finally {
>   conn.close()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26383) NPE when use DataFrameReader.jdbc with wrong URL

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26383:


Assignee: (was: Apache Spark)

> NPE when use DataFrameReader.jdbc with wrong URL
> 
>
> Key: SPARK-26383
> URL: https://issues.apache.org/jira/browse/SPARK-26383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: clouds
>Priority: Minor
>
> When passing wrong url to jdbc:
> {code:java}
> val opts = Map(
>   "url" -> "jdbc:mysql://localhost/db",
>   "dbtable" -> "table",
>   "driver" -> "org.postgresql.Driver"
> )
> var df = spark.read.format("jdbc").options(opts).load
> {code}
> It would throw an NPE instead of complaining about connection failed. (Note 
> url and driver not matched here)
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
> {code}
> as [postgresql jdbc driver 
> document|https://jdbc.postgresql.org/development/privateapi/org/postgresql/Driver.html#connect-java.lang.String-java.util.Properties-]
>  saying, The driver should return "null" if it realizes it is the wrong kind 
> of driver to connect to the given URL.
> while 
> [ConnectionFactory|https://github.com/apache/spark/blob/e743e848484bf7d97e1b4f33ea83f8520ae7da04/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L56]
>  would not check if conn is null.
> {code:java}
> val conn: Connection = JdbcUtils.createConnectionFactory(options)()
> {code}
>  and trying to close the conn anyway
> {code:java}
> try {
>   ...
> } finally {
>   conn.close()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26078) WHERE .. IN fails to filter rows when used in combination with UNION

2019-01-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26078:
--
Fix Version/s: 2.3.3

> WHERE .. IN fails to filter rows when used in combination with UNION
> 
>
> Key: SPARK-26078
> URL: https://issues.apache.org/jira/browse/SPARK-26078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Arttu Voutilainen
>Assignee: Marco Gaido
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> Hey,
> We encountered a case where Spark SQL does not seem to handle WHERE .. IN 
> correctly, when used in combination with UNION, but instead returns also rows 
> that do not fulfill the condition. Swapping the order of the datasets in the 
> UNION makes the problem go away. Repro below:
>  
> {code}
> sql = SQLContext(sc)
> a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> a.registerTempTable('a')
> b.registerTempTable('b')
> bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'a' as source FROM a
> UNION ALL
> SELECT id, num, 'b' as source FROM b
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> no_bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'b' as source FROM b
> UNION ALL
> SELECT id, num, 'a' as source FROM a
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> bug.show()
> no_bug.show()
> bug.explain(True)
> no_bug.explain(True)
> {code}
> This results in one extra row in the "bug" DF coming from DF "b", that should 
> not be there as it  
> {code:java}
> >>> bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| a|
> |  a|  2| b|
> |  b|  1| b|
> +---+---+--+
> >>> no_bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| b|
> |  a|  2| a|
> +---+---+--+
> {code}
>  The reason can be seen in the query plans:
> {code:java}
> >>> bug.explain(True)
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#0, num#1L, a AS source#136]
> :  +- Join LeftSemi, (id#0 = id#4)
> : :- LogicalRDD [id#0, num#1L], false
> : +- Project [id#4]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Join LeftSemi, (id#4#172 = id#4#172)
>:- Project [id#4, num#5L, b AS source#137]
>:  +- LogicalRDD [id#4, num#5L], false
>+- Project [id#4 AS id#4#172]
>   +- Filter (isnotnull(num#5L) && (num#5L = 2))
>  +- LogicalRDD [id#4, num#5L], false
> {code}
> Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition 
> seems wrong, and I believe it causes the LeftSemi to return true for all rows 
> in the left-hand-side table, thus failing to filter as the WHERE .. IN 
> should. Compare with the non-buggy version, where both LeftSemi joins have 
> distinct #-things on both sides:
> {code:java}
> >>> no_bug.explain()
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#4, num#5L, b AS source#142]
> :  +- Join LeftSemi, (id#4 = id#4#173)
> : :- LogicalRDD [id#4, num#5L], false
> : +- Project [id#4 AS id#4#173]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Project [id#0, num#1L, a AS source#143]
>+- Join LeftSemi, (id#0 = id#4#173)
>   :- LogicalRDD [id#0, num#1L], false
>   +- Project [id#4 AS id#4#173]
>  +- Filter (isnotnull(num#5L) && (num#5L = 2))
> +- LogicalRDD [id#4, num#5L], false
> {code}
>  
> Best,
> -Arttu 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26540) Support PostgreSQL numeric arrays without precision/scale

2019-01-05 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734887#comment-16734887
 ] 

Takeshi Yamamuro commented on SPARK-26540:
--

We need to close SPARK-26538 as duplicated when resolving this.

> Support PostgreSQL numeric arrays without precision/scale
> -
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26546) Caching of DateTimeFormatter

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26546:


Assignee: Apache Spark

> Caching of DateTimeFormatter
> 
>
> Key: SPARK-26546
> URL: https://issues.apache.org/jira/browse/SPARK-26546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, instances of java.time.format.DateTimeFormatter are built each 
> time when new instance of Iso8601DateFormatter or Iso8601TimestampFormatter 
> is created which is time consuming operation because it should parse the 
> timestamp/date patterns. It could be useful to create a cache with key = 
> (pattern, locale) and value = instance of java.time.format.DateTimeFormatter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26546) Caching of DateTimeFormatter

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26546:


Assignee: (was: Apache Spark)

> Caching of DateTimeFormatter
> 
>
> Key: SPARK-26546
> URL: https://issues.apache.org/jira/browse/SPARK-26546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, instances of java.time.format.DateTimeFormatter are built each 
> time when new instance of Iso8601DateFormatter or Iso8601TimestampFormatter 
> is created which is time consuming operation because it should parse the 
> timestamp/date patterns. It could be useful to create a cache with key = 
> (pattern, locale) and value = instance of java.time.format.DateTimeFormatter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26546) Caching of DateTimeFormatter

2019-01-05 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26546:
--

 Summary: Caching of DateTimeFormatter
 Key: SPARK-26546
 URL: https://issues.apache.org/jira/browse/SPARK-26546
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, instances of java.time.format.DateTimeFormatter are built each time 
when new instance of Iso8601DateFormatter or Iso8601TimestampFormatter is 
created which is time consuming operation because it should parse the 
timestamp/date patterns. It could be useful to create a cache with key = 
(pattern, locale) and value = instance of java.time.format.DateTimeFormatter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26545) Fix typo in EqualNullSafe's truth table comment

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26545:


Assignee: Apache Spark

> Fix typo in EqualNullSafe's truth table comment
> ---
>
> Key: SPARK-26545
> URL: https://issues.apache.org/jira/browse/SPARK-26545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Assignee: Apache Spark
>Priority: Trivial
>
> The truth table comment in {{EqualNullSafe}} incorrectly marked FALSE results 
> as UNKNOWN



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26545) Fix typo in EqualNullSafe's truth table comment

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26545:


Assignee: (was: Apache Spark)

> Fix typo in EqualNullSafe's truth table comment
> ---
>
> Key: SPARK-26545
> URL: https://issues.apache.org/jira/browse/SPARK-26545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Priority: Trivial
>
> The truth table comment in {{EqualNullSafe}} incorrectly marked FALSE results 
> as UNKNOWN



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26545) Fix typo in EqualNullSafe's truth table comment

2019-01-05 Thread Kris Mok (JIRA)
Kris Mok created SPARK-26545:


 Summary: Fix typo in EqualNullSafe's truth table comment
 Key: SPARK-26545
 URL: https://issues.apache.org/jira/browse/SPARK-26545
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kris Mok


The truth table comment in {{EqualNullSafe}} incorrectly marked FALSE results 
as UNKNOWN



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26540) Support PostgreSQL numeric arrays without precision/scale

2019-01-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26540:
--
Summary: Support PostgreSQL numeric arrays without precision/scale  (was: 
Requirement failed when reading numeric arrays from PostgreSQL)

> Support PostgreSQL numeric arrays without precision/scale
> -
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26541) Add `-Pdocker-integration-tests` to `dev/scalastyle`

2019-01-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26541.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23459

> Add `-Pdocker-integration-tests` to `dev/scalastyle`
> 
>
> Key: SPARK-26541
> URL: https://issues.apache.org/jira/browse/SPARK-26541
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> This issue makes `scalastyle` to check `docker-integration-tests` module and 
> fixes one error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26373) Spark UI 'environment' tab - column to indicate default vs overridden values

2019-01-05 Thread Pablo Langa Blanco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734814#comment-16734814
 ] 

Pablo Langa Blanco commented on SPARK-26373:


Hi [~toopt4],

Could you explain what is the utility of it?

I'm thinking about it. When you start a spark application you know what are the 
properties you have set (throught park-defaults.conf, SparkConf, or the command 
line) and all the properties that you dont set, are available in the 
documentation

[https://spark.apache.org/docs/latest/configuration.html]

Thanks for the proposall!

> Spark UI 'environment' tab - column to indicate default vs overridden values
> 
>
> Key: SPARK-26373
> URL: https://issues.apache.org/jira/browse/SPARK-26373
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: t oo
>Priority: Major
>
> Rather than just showing name and value for each property, a new column would 
> also show whether the value is default (show 'AS PER DEFAULT') or if its 
> overridden (show the actual default value).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-05 Thread chenliang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734810#comment-16734810
 ] 

chenliang edited comment on SPARK-26543 at 1/5/19 8:30 AM:
---

[~r...@databricks.com][~markhamstra][~cloud_fan] Could you please help to look 
at this,thank you!


was (Author: southernriver):
[~r...@databricks.com][~markhamstra][~cloud_fan] Could you please help look at 
this,thank you!

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: SPARK-26543.patch, image-2019-01-05-13-18-30-487.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !image-2019-01-05-13-18-30-487.png!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-05 Thread chenliang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734810#comment-16734810
 ] 

chenliang commented on SPARK-26543:
---

[~r...@databricks.com][~markhamstra][~cloud_fan] Could you please help look at 
this,thank you!

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: SPARK-26543.patch, image-2019-01-05-13-18-30-487.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !image-2019-01-05-13-18-30-487.png!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-05 Thread chenliang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Attachment: SPARK-26543.patch

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: SPARK-26543.patch, image-2019-01-05-13-18-30-487.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !image-2019-01-05-13-18-30-487.png!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26544) escape string when serialize map/array to make it a valid json (keep alignment with hive)

2019-01-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734809#comment-16734809
 ] 

Apache Spark commented on SPARK-26544:
--

User 'WangGuangxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23460

> escape string when serialize map/array to make it a valid json (keep 
> alignment with hive)
> -
>
> Key: SPARK-26544
> URL: https://issues.apache.org/jira/browse/SPARK-26544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Major
>
> when reading a hive table with map/array type, the string serialized by 
> thrift server is not a valid json, while hive is. 
> For example, select a field whose type is map, the spark 
> thrift server returns 
>  
> {code:java}
> {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"}
> {code}
>  
> while hive thriftserver returns
>  
> {code:java}
> {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26544) escape string when serialize map/array to make it a valid json (keep alignment with hive)

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26544:


Assignee: Apache Spark

> escape string when serialize map/array to make it a valid json (keep 
> alignment with hive)
> -
>
> Key: SPARK-26544
> URL: https://issues.apache.org/jira/browse/SPARK-26544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Assignee: Apache Spark
>Priority: Major
>
> when reading a hive table with map/array type, the string serialized by 
> thrift server is not a valid json, while hive is. 
> For example, select a field whose type is map, the spark 
> thrift server returns 
>  
> {code:java}
> {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"}
> {code}
>  
> while hive thriftserver returns
>  
> {code:java}
> {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26544) escape string when serialize map/array to make it a valid json (keep alignment with hive)

2019-01-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26544:


Assignee: (was: Apache Spark)

> escape string when serialize map/array to make it a valid json (keep 
> alignment with hive)
> -
>
> Key: SPARK-26544
> URL: https://issues.apache.org/jira/browse/SPARK-26544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Major
>
> when reading a hive table with map/array type, the string serialized by 
> thrift server is not a valid json, while hive is. 
> For example, select a field whose type is map, the spark 
> thrift server returns 
>  
> {code:java}
> {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"}
> {code}
>  
> while hive thriftserver returns
>  
> {code:java}
> {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26544) escape string when serialize map/array to make it a valid json (keep alignment with hive)

2019-01-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734807#comment-16734807
 ] 

Apache Spark commented on SPARK-26544:
--

User 'WangGuangxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23460

> escape string when serialize map/array to make it a valid json (keep 
> alignment with hive)
> -
>
> Key: SPARK-26544
> URL: https://issues.apache.org/jira/browse/SPARK-26544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Major
>
> when reading a hive table with map/array type, the string serialized by 
> thrift server is not a valid json, while hive is. 
> For example, select a field whose type is map, the spark 
> thrift server returns 
>  
> {code:java}
> {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"}
> {code}
>  
> while hive thriftserver returns
>  
> {code:java}
> {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org