[jira] [Assigned] (SPARK-26551) Selecting one complex field and having is null predicate on another complex field can cause error
[ https://issues.apache.org/jira/browse/SPARK-26551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26551: Assignee: (was: Apache Spark) > Selecting one complex field and having is null predicate on another complex > field can cause error > - > > Key: SPARK-26551 > URL: https://issues.apache.org/jira/browse/SPARK-26551 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > The query below can cause error when doing schema pruning: > {code:java} > val query = sql("select * from contacts") > .where("name.middle is not null") > .select( > "id", > "name.first", > "name.middle", > "name.last" > ) > .where("last = 'Jones'") > .select(count("id") > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26551) Selecting one complex field and having is null predicate on another complex field can cause error
[ https://issues.apache.org/jira/browse/SPARK-26551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26551: Assignee: Apache Spark > Selecting one complex field and having is null predicate on another complex > field can cause error > - > > Key: SPARK-26551 > URL: https://issues.apache.org/jira/browse/SPARK-26551 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Major > > The query below can cause error when doing schema pruning: > {code:java} > val query = sql("select * from contacts") > .where("name.middle is not null") > .select( > "id", > "name.first", > "name.middle", > "name.last" > ) > .where("last = 'Jones'") > .select(count("id") > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26551) Selecting one complex field and having is null predicate on another complex field can cause error
Liang-Chi Hsieh created SPARK-26551: --- Summary: Selecting one complex field and having is null predicate on another complex field can cause error Key: SPARK-26551 URL: https://issues.apache.org/jira/browse/SPARK-26551 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh The query below can cause error when doing schema pruning: {code:java} val query = sql("select * from contacts") .where("name.middle is not null") .select( "id", "name.first", "name.middle", "name.last" ) .where("last = 'Jones'") .select(count("id") {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26548) Don't block during query optimization
[ https://issues.apache.org/jira/browse/SPARK-26548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-26548. - Resolution: Fixed Fix Version/s: 3.0.0 > Don't block during query optimization > - > > Key: SPARK-26548 > URL: https://issues.apache.org/jira/browse/SPARK-26548 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dave DeCaprio >Assignee: Dave DeCaprio >Priority: Minor > Labels: sql > Fix For: 3.0.0 > > > In Spark 2.4.0 the CacheManager was updated so it will not execute jobs while > it holds a lock.This was introduced in -SPARK-23880.- > The CacheManager still holds a write lock during the execution of the query > optimizer. For complex queries the optimizer can run for a long time (we see > 10-15 minutes for some exceptionally large queries). This allows only 1 > thread to optimize at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26548) Don't block during query optimization
[ https://issues.apache.org/jira/browse/SPARK-26548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-26548: --- Assignee: Dave DeCaprio > Don't block during query optimization > - > > Key: SPARK-26548 > URL: https://issues.apache.org/jira/browse/SPARK-26548 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dave DeCaprio >Assignee: Dave DeCaprio >Priority: Minor > Labels: sql > > In Spark 2.4.0 the CacheManager was updated so it will not execute jobs while > it holds a lock.This was introduced in -SPARK-23880.- > The CacheManager still holds a write lock during the execution of the query > optimizer. For complex queries the optimizer can run for a long time (we see > 10-15 minutes for some exceptionally large queries). This allows only 1 > thread to optimize at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26537) update the release scripts to point to gitbox
[ https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735060#comment-16735060 ] Dongjoon Hyun commented on SPARK-26537: --- This is resolved via - https://github.com/apache/spark/pull/23454 - https://github.com/apache/spark/pull/23472 - https://github.com/apache/spark/pull/23473 > update the release scripts to point to gitbox > - > > Key: SPARK-26537 > URL: https://issues.apache.org/jira/browse/SPARK-26537 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0 > > > we're seeing packaging build failures like this: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console] > i did a quick skim through the repo, and found the offending urls to the old > apache git repos: > > {code:java} > (py35) ➜ spark git:(update-apache-repo) grep -r git-wip * > dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git" > dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git"; > dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git"; > pom.xml: > scm:git:https://git-wip-us.apache.org/repos/asf/spark.git > {code} > this affects all versions of spark, so it will need to be backported to all > released versions. > i'll put together a pull request later today. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26537) update the release scripts to point to gitbox
[ https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26537. --- Resolution: Fixed Fix Version/s: 3.0.0 2.4.1 2.3.3 2.2.3 > update the release scripts to point to gitbox > - > > Key: SPARK-26537 > URL: https://issues.apache.org/jira/browse/SPARK-26537 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0 > > > we're seeing packaging build failures like this: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console] > i did a quick skim through the repo, and found the offending urls to the old > apache git repos: > > {code:java} > (py35) ➜ spark git:(update-apache-repo) grep -r git-wip * > dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git" > dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git"; > dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git"; > pom.xml: > scm:git:https://git-wip-us.apache.org/repos/asf/spark.git > {code} > this affects all versions of spark, so it will need to be backported to all > released versions. > i'll put together a pull request later today. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
[ https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734934#comment-16734934 ] Dongjoon Hyun edited comment on SPARK-25692 at 1/6/19 12:00 AM: Hi, [~zsxwing] and [~tgraves]. While looking other failures, I notice that this failure still happens frequently. The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might be related. - [master 5829|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5829/testReport] (amp-jenkins-worker-05) - [master 5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport] (amp-jenkins-worker-05) - [master 5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport] (amp-jenkins-worker-05) - [master 5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100787|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100787/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100788|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100788/consoleFull] (amp-jenkins-worker-05) was (Author: dongjoon): Hi, [~zsxwing] and [~tgraves]. While looking other failures, I notice that this failure still happens frequently. The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might be related. - - [master 5829|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5829/testReport] (amp-jenkins-worker-05) - [master 5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport] (amp-jenkins-worker-05) - [master 5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport] (amp-jenkins-worker-05) - [master 5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100787|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100787/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100788|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100788/consoleFull] (amp-jenkins-worker-05) > Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks > -- > > Key: SPARK-25692 > URL: https://issues.apache.org/jira/browse/SPARK-25692 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Priority: Blocker > Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot > 2018-11-01 at 10.17.16 AM.png > > > Looks like the whole test suite is pretty flaky. See: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ > This may be a regression in 3.0 as this didn't happen in 2.4 branch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
[ https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734934#comment-16734934 ] Dongjoon Hyun edited comment on SPARK-25692 at 1/5/19 11:59 PM: Hi, [~zsxwing] and [~tgraves]. While looking other failures, I notice that this failure still happens frequently. The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might be related. - - [master 5829|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5829/testReport] (amp-jenkins-worker-05) - [master 5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport] (amp-jenkins-worker-05) - [master 5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport] (amp-jenkins-worker-05) - [master 5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100787|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100787/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100788|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100788/consoleFull] (amp-jenkins-worker-05) was (Author: dongjoon): Hi, [~zsxwing] and [~tgraves]. While looking other failures, I notice that this failure still happens frequently. The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might be related. - [master 5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport] (amp-jenkins-worker-05) - [master 5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport] (amp-jenkins-worker-05) - [master 5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100787|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100787/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100788|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100788/consoleFull] (amp-jenkins-worker-05) > Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks > -- > > Key: SPARK-25692 > URL: https://issues.apache.org/jira/browse/SPARK-25692 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Priority: Blocker > Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot > 2018-11-01 at 10.17.16 AM.png > > > Looks like the whole test suite is pretty flaky. See: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ > This may be a regression in 3.0 as this didn't happen in 2.4 branch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26535) Parsing literals as DOUBLE instead of DECIMAL
[ https://issues.apache.org/jira/browse/SPARK-26535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26535. --- Resolution: Won't Do Hi, [~mgaido]. First of all, Hive starts to use `Decimal` by default. Also, this introduces TPCDS-H query result difference between Spark versions. We cannot do this. {code} hive> select version(); OK 3.1.1 rf4e0529634b6231a0072295da48af466cf2f10b7 Time taken: 0.089 seconds, Fetched: 1 row(s) hive> explain select 2.3; OK STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: _dummy_table Row Limit Per Split: 1 Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: 2.3 (type: decimal(2,1)) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE ListSink {code} > Parsing literals as DOUBLE instead of DECIMAL > - > > Key: SPARK-26535 > URL: https://issues.apache.org/jira/browse/SPARK-26535 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Marco Gaido >Priority: Major > > As pointed out by [~dkbiswal]'s comment > https://github.com/apache/spark/pull/22450#issuecomment-423082389, most of > other RDBMS (DB2, Presto, Hive, MSSQL) consider literals as DOUBLE by default. > Spark as of now consider them as DECIMAL. This is quite problematic > especially in relation with the operations on decimal, for which we base our > implementation on Hive/MSSQL. > So this ticket is for moving by default the resolution of literals as DOUBLE, > but with a config which allows to get back to the previous behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26402) Accessing nested fields with different cases in case insensitive mode
[ https://issues.apache.org/jira/browse/SPARK-26402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735046#comment-16735046 ] Dongjoon Hyun commented on SPARK-26402: --- Hi, [~smilegator]. This is not a correctness issue because it fails with AnalysisException previously. > Accessing nested fields with different cases in case insensitive mode > - > > Key: SPARK-26402 > URL: https://issues.apache.org/jira/browse/SPARK-26402 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > {{GetStructField}} with different optional names should be semantically > equal. We will use this as building block to compare the nested fields used > in the plans to be optimized by catalyst optimizer. > This PR also fixes a bug below that accessing nested fields with different > cases in case insensitive mode will result result {{AnalysisException}}. > {code:java} > sql("create table t (s struct) using json") > sql("select s.I from t group by s.i") > {code} > which is currently failing > {code:java} > org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither > present in the group by, nor is it an aggregate function > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26402) Accessing nested fields with different cases in case insensitive mode
[ https://issues.apache.org/jira/browse/SPARK-26402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26402. --- Resolution: Fixed Fix Version/s: 3.0.0 2.4.1 This is resolved via https://github.com/apache/spark/pull/23353 > Accessing nested fields with different cases in case insensitive mode > - > > Key: SPARK-26402 > URL: https://issues.apache.org/jira/browse/SPARK-26402 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > {{GetStructField}} with different optional names should be semantically > equal. We will use this as building block to compare the nested fields used > in the plans to be optimized by catalyst optimizer. > This PR also fixes a bug below that accessing nested fields with different > cases in case insensitive mode will result result {{AnalysisException}}. > {code:java} > sql("create table t (s struct) using json") > sql("select s.I from t group by s.i") > {code} > which is currently failing > {code:java} > org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither > present in the group by, nor is it an aggregate function > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-26535) Parsing literals as DOUBLE instead of DECIMAL
[ https://issues.apache.org/jira/browse/SPARK-26535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-26535. - > Parsing literals as DOUBLE instead of DECIMAL > - > > Key: SPARK-26535 > URL: https://issues.apache.org/jira/browse/SPARK-26535 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Marco Gaido >Priority: Major > > As pointed out by [~dkbiswal]'s comment > https://github.com/apache/spark/pull/22450#issuecomment-423082389, most of > other RDBMS (DB2, Presto, Hive, MSSQL) consider literals as DOUBLE by default. > Spark as of now consider them as DECIMAL. This is quite problematic > especially in relation with the operations on decimal, for which we base our > implementation on Hive/MSSQL. > So this ticket is for moving by default the resolution of literals as DOUBLE, > but with a config which allows to get back to the previous behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26550) New datasource for benchmarking
[ https://issues.apache.org/jira/browse/SPARK-26550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735040#comment-16735040 ] Dongjoon Hyun commented on SPARK-26550: --- The PR title looks more intuitive to me. > New datasource for benchmarking > --- > > Key: SPARK-26550 > URL: https://issues.apache.org/jira/browse/SPARK-26550 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Purpose of new datasource is materialisation of dataset without additional > overhead associated with actions and converting row's values to other types. > This can be used in benchmarking as well as in cases when need to materialise > a dataset for side effects like in caching. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26537) update the release scripts to point to gitbox
[ https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735038#comment-16735038 ] Dongjoon Hyun commented on SPARK-26537: --- I guess we can skip `branch-1.6/2.0/2.1` because those branches are EOL and the Jenkins jobs are stopped for a while. > update the release scripts to point to gitbox > - > > Key: SPARK-26537 > URL: https://issues.apache.org/jira/browse/SPARK-26537 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > we're seeing packaging build failures like this: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console] > i did a quick skim through the repo, and found the offending urls to the old > apache git repos: > > {code:java} > (py35) ➜ spark git:(update-apache-repo) grep -r git-wip * > dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git" > dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git"; > dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git"; > pom.xml: > scm:git:https://git-wip-us.apache.org/repos/asf/spark.git > {code} > this affects all versions of spark, so it will need to be backported to all > released versions. > i'll put together a pull request later today. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26537) update the release scripts to point to gitbox
[ https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26537: -- Target Version/s: 2.2.3, 2.3.3, 2.4.1, 3.0.0 > update the release scripts to point to gitbox > - > > Key: SPARK-26537 > URL: https://issues.apache.org/jira/browse/SPARK-26537 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > we're seeing packaging build failures like this: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console] > i did a quick skim through the repo, and found the offending urls to the old > apache git repos: > > {code:java} > (py35) ➜ spark git:(update-apache-repo) grep -r git-wip * > dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git" > dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git"; > dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git"; > pom.xml: > scm:git:https://git-wip-us.apache.org/repos/asf/spark.git > {code} > this affects all versions of spark, so it will need to be backported to all > released versions. > i'll put together a pull request later today. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26537) update the release scripts to point to gitbox
[ https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26537: -- Affects Version/s: 2.2.0 > update the release scripts to point to gitbox > - > > Key: SPARK-26537 > URL: https://issues.apache.org/jira/browse/SPARK-26537 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > we're seeing packaging build failures like this: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console] > i did a quick skim through the repo, and found the offending urls to the old > apache git repos: > > {code:java} > (py35) ➜ spark git:(update-apache-repo) grep -r git-wip * > dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git" > dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git"; > dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git"; > pom.xml: > scm:git:https://git-wip-us.apache.org/repos/asf/spark.git > {code} > this affects all versions of spark, so it will need to be backported to all > released versions. > i'll put together a pull request later today. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26545) Fix typo in EqualNullSafe's truth table comment
[ https://issues.apache.org/jira/browse/SPARK-26545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-26545. - Resolution: Fixed Assignee: Kris Mok Fix Version/s: 3.0.0 2.4.1 2.3.3 2.2.3 > Fix typo in EqualNullSafe's truth table comment > --- > > Key: SPARK-26545 > URL: https://issues.apache.org/jira/browse/SPARK-26545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Trivial > Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0 > > > The truth table comment in {{EqualNullSafe}} incorrectly marked FALSE results > as UNKNOWN -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26550) New datasource for benchmarking
[ https://issues.apache.org/jira/browse/SPARK-26550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26550: Assignee: (was: Apache Spark) > New datasource for benchmarking > --- > > Key: SPARK-26550 > URL: https://issues.apache.org/jira/browse/SPARK-26550 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Purpose of new datasource is materialisation of dataset without additional > overhead associated with actions and converting row's values to other types. > This can be used in benchmarking as well as in cases when need to materialise > a dataset for side effects like in caching. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26550) New datasource for benchmarking
[ https://issues.apache.org/jira/browse/SPARK-26550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26550: Assignee: Apache Spark > New datasource for benchmarking > --- > > Key: SPARK-26550 > URL: https://issues.apache.org/jira/browse/SPARK-26550 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Purpose of new datasource is materialisation of dataset without additional > overhead associated with actions and converting row's values to other types. > This can be used in benchmarking as well as in cases when need to materialise > a dataset for side effects like in caching. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26550) New datasource for benchmarking
Maxim Gekk created SPARK-26550: -- Summary: New datasource for benchmarking Key: SPARK-26550 URL: https://issues.apache.org/jira/browse/SPARK-26550 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Purpose of new datasource is materialisation of dataset without additional overhead associated with actions and converting row's values to other types. This can be used in benchmarking as well as in cases when need to materialise a dataset for side effects like in caching. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26549) PySpark worker reuse take no effect for Python3
[ https://issues.apache.org/jira/browse/SPARK-26549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26549: Assignee: Apache Spark > PySpark worker reuse take no effect for Python3 > --- > > Key: SPARK-26549 > URL: https://issues.apache.org/jira/browse/SPARK-26549 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Apache Spark >Priority: Major > > During [the follow-up > work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for > PySpark worker reuse scenario, we found that the worker reuse takes no effect > for Python3 while works properly for Python2 and PyPy. > It happened because, during the python worker check end of the stream in > Python3, we got an unexpected value -1 here which refers to > END_OF_DATA_SECTION. See the code in worker.py: > {code:python} > # check end of stream > if read_int(infile) == SpecialLengths.END_OF_STREAM: > write_int(SpecialLengths.END_OF_STREAM, outfile) > else: > # write a different value to tell JVM to not reuse this worker > write_int(SpecialLengths.END_OF_DATA_SECTION, outfile) > sys.exit(-1) > {code} > The code works well for Python2 and PyPy because the END_OF_DATA_SECTION has > been handled during load iterator from the socket stream, see the code in > FramedSerializer: > {code:python} > def load_stream(self, stream): > while True: > try: > yield self._read_with_length(stream) > except EOFError: > return > ... > def _read_with_length(self, stream): > length = read_int(stream) > if length == SpecialLengths.END_OF_DATA_SECTION: > raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in > load_stream > elif length == SpecialLengths.NULL: > return None > obj = stream.read(length) > if len(obj) < length: > raise EOFError > return self.loads(obj) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26549) PySpark worker reuse take no effect for Python3
[ https://issues.apache.org/jira/browse/SPARK-26549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26549: Assignee: (was: Apache Spark) > PySpark worker reuse take no effect for Python3 > --- > > Key: SPARK-26549 > URL: https://issues.apache.org/jira/browse/SPARK-26549 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > During [the follow-up > work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for > PySpark worker reuse scenario, we found that the worker reuse takes no effect > for Python3 while works properly for Python2 and PyPy. > It happened because, during the python worker check end of the stream in > Python3, we got an unexpected value -1 here which refers to > END_OF_DATA_SECTION. See the code in worker.py: > {code:python} > # check end of stream > if read_int(infile) == SpecialLengths.END_OF_STREAM: > write_int(SpecialLengths.END_OF_STREAM, outfile) > else: > # write a different value to tell JVM to not reuse this worker > write_int(SpecialLengths.END_OF_DATA_SECTION, outfile) > sys.exit(-1) > {code} > The code works well for Python2 and PyPy because the END_OF_DATA_SECTION has > been handled during load iterator from the socket stream, see the code in > FramedSerializer: > {code:python} > def load_stream(self, stream): > while True: > try: > yield self._read_with_length(stream) > except EOFError: > return > ... > def _read_with_length(self, stream): > length = read_int(stream) > if length == SpecialLengths.END_OF_DATA_SECTION: > raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in > load_stream > elif length == SpecialLengths.NULL: > return None > obj = stream.read(length) > if len(obj) < length: > raise EOFError > return self.loads(obj) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26457) Show hadoop configurations in HistoryServer environment tab
[ https://issues.apache.org/jira/browse/SPARK-26457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734986#comment-16734986 ] Pablo Langa Blanco commented on SPARK-26457: Hi [~deshanxiao], What configurations are you thinking about? Could you explain cases where this information could be relevant. I'm thinking in the case that you are working with yarn, yarn has all the information about hadoop that we could need about the spark job, you dont need it duplicated in History server. Thanks! > Show hadoop configurations in HistoryServer environment tab > --- > > Key: SPARK-26457 > URL: https://issues.apache.org/jira/browse/SPARK-26457 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Affects Versions: 2.3.2, 2.4.0 > Environment: Maybe it is good to show some configurations in > HistoryServer environment tab for debugging some bugs about hadoop >Reporter: deshanxiao >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25917) Spark UI's executors page loads forever when memoryMetrics in None. Fix is to JSON ignore memorymetrics when it is None.
[ https://issues.apache.org/jira/browse/SPARK-25917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734984#comment-16734984 ] Pablo Langa Blanco commented on SPARK-25917: The pull request was closed because the problem has been solved already, the issue should be closed too. > Spark UI's executors page loads forever when memoryMetrics in None. Fix is to > JSON ignore memorymetrics when it is None. > > > Key: SPARK-25917 > URL: https://issues.apache.org/jira/browse/SPARK-25917 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: Rong Tang >Priority: Major > > Spark UI's executors page loads forever when memoryMetrics in None. Fix is to > JSON ignore memorymetrics when it is None. > ## How was this patch tested? > Before fix: (loads forever) > ![image](https://user-images.githubusercontent.com/1785565/47875681-64dfe480-ddd4-11e8-8d15-5ed1457bc24f.png) > After fix: > ![image](https://user-images.githubusercontent.com/1785565/47875691-6b6e5c00-ddd4-11e8-9895-db8dd9730ee1.png) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26549) PySpark worker reuse take no effect for Python3
[ https://issues.apache.org/jira/browse/SPARK-26549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li updated SPARK-26549: Description: During [the follow-up work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for PySpark worker reuse scenario, we found that the worker reuse takes no effect for Python3 while works properly for Python2 and PyPy. It happened because, during the python worker check end of the stream in Python3, we got an unexpected value -1 here which refers to END_OF_DATA_SECTION. See the code in worker.py: {code:python} # check end of stream if read_int(infile) == SpecialLengths.END_OF_STREAM: write_int(SpecialLengths.END_OF_STREAM, outfile) else: # write a different value to tell JVM to not reuse this worker write_int(SpecialLengths.END_OF_DATA_SECTION, outfile) sys.exit(-1) {code} The code works well for Python2 and PyPy cause the END_OF_DATA_SECTION has been handled during load iterator from the socket stream, see the code in FramedSerializer: {code:python} def load_stream(self, stream): while True: try: yield self._read_with_length(stream) except EOFError: return ... def _read_with_length(self, stream): length = read_int(stream) if length == SpecialLengths.END_OF_DATA_SECTION: raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in load_stream elif length == SpecialLengths.NULL: return None obj = stream.read(length) if len(obj) < length: raise EOFError return self.loads(obj) {code} was: During [the follow-up work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for PySpark worker reuse scenario, we found that the worker reuse takes no effect for Python3 while works properly for Python2 and PyPy. It happened because, during the python worker check end of the stream in Python3, we got an unexpected value -1 here which refers to END_OF_DATA_SECTION. See the code in worker.py: {code:python} # check end of stream if read_int(infile) == SpecialLengths.END_OF_STREAM: write_int(SpecialLengths.END_OF_STREAM, outfile) else: # write a different value to tell JVM to not reuse this worker write_int(SpecialLengths.END_OF_DATA_SECTION, outfile) sys.exit(-1) {code} The code works well for Python2 and PyPy cause the END_OF_DATA_SECTION has been handled during load iterator from the socket stream, see the code in FramedSerializer: {code:python} def load_stream(self, stream): while True: try: yield self._read_with_length(stream) except EOFError: return ... def _read_with_length(self, stream): length = read_int(stream) if length == SpecialLengths.END_OF_DATA_SECTION: raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in load_stream elif length == SpecialLengths.NULL: return None obj = stream.read(length) if len(obj) < length: raise EOFError return self.loads(obj) {code} > PySpark worker reuse take no effect for Python3 > --- > > Key: SPARK-26549 > URL: https://issues.apache.org/jira/browse/SPARK-26549 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > During [the follow-up > work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for > PySpark worker reuse scenario, we found that the worker reuse takes no effect > for Python3 while works properly for Python2 and PyPy. > It happened because, during the python worker check end of the stream in > Python3, we got an unexpected value -1 here which refers to > END_OF_DATA_SECTION. See the code in worker.py: > {code:python} > # check end of stream > if read_int(infile) == SpecialLengths.END_OF_STREAM: > write_int(SpecialLengths.END_OF_STREAM, outfile) > else: > # write a different value to tell JVM to not reuse this worker > write_int(SpecialLengths.END_OF_DATA_SECTION, outfile) > sys.exit(-1) > {code} > The code works well for Python2 and PyPy cause the END_OF_DATA_SECTION has > been handled during load iterator from the socket stream, see the code in > FramedSerializer: > {code:python} > def load_stream(self, stream): > while True: > try: > yield self._read_with_length(stream) > except EOFError: > return > ... > def _read_with_length(self, stream): > length = read_int(stream) > if length == SpecialLengths.END_OF_DATA_SECTION: > raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in > load_stream > elif length == SpecialLengths.NULL: > return None > obj = stream.read(length) > if le
[jira] [Updated] (SPARK-26549) PySpark worker reuse take no effect for Python3
[ https://issues.apache.org/jira/browse/SPARK-26549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li updated SPARK-26549: Description: During [the follow-up work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for PySpark worker reuse scenario, we found that the worker reuse takes no effect for Python3 while works properly for Python2 and PyPy. It happened because, during the python worker check end of the stream in Python3, we got an unexpected value -1 here which refers to END_OF_DATA_SECTION. See the code in worker.py: {code:python} # check end of stream if read_int(infile) == SpecialLengths.END_OF_STREAM: write_int(SpecialLengths.END_OF_STREAM, outfile) else: # write a different value to tell JVM to not reuse this worker write_int(SpecialLengths.END_OF_DATA_SECTION, outfile) sys.exit(-1) {code} The code works well for Python2 and PyPy because the END_OF_DATA_SECTION has been handled during load iterator from the socket stream, see the code in FramedSerializer: {code:python} def load_stream(self, stream): while True: try: yield self._read_with_length(stream) except EOFError: return ... def _read_with_length(self, stream): length = read_int(stream) if length == SpecialLengths.END_OF_DATA_SECTION: raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in load_stream elif length == SpecialLengths.NULL: return None obj = stream.read(length) if len(obj) < length: raise EOFError return self.loads(obj) {code} was: During [the follow-up work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for PySpark worker reuse scenario, we found that the worker reuse takes no effect for Python3 while works properly for Python2 and PyPy. It happened because, during the python worker check end of the stream in Python3, we got an unexpected value -1 here which refers to END_OF_DATA_SECTION. See the code in worker.py: {code:python} # check end of stream if read_int(infile) == SpecialLengths.END_OF_STREAM: write_int(SpecialLengths.END_OF_STREAM, outfile) else: # write a different value to tell JVM to not reuse this worker write_int(SpecialLengths.END_OF_DATA_SECTION, outfile) sys.exit(-1) {code} The code works well for Python2 and PyPy cause the END_OF_DATA_SECTION has been handled during load iterator from the socket stream, see the code in FramedSerializer: {code:python} def load_stream(self, stream): while True: try: yield self._read_with_length(stream) except EOFError: return ... def _read_with_length(self, stream): length = read_int(stream) if length == SpecialLengths.END_OF_DATA_SECTION: raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in load_stream elif length == SpecialLengths.NULL: return None obj = stream.read(length) if len(obj) < length: raise EOFError return self.loads(obj) {code} > PySpark worker reuse take no effect for Python3 > --- > > Key: SPARK-26549 > URL: https://issues.apache.org/jira/browse/SPARK-26549 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > During [the follow-up > work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for > PySpark worker reuse scenario, we found that the worker reuse takes no effect > for Python3 while works properly for Python2 and PyPy. > It happened because, during the python worker check end of the stream in > Python3, we got an unexpected value -1 here which refers to > END_OF_DATA_SECTION. See the code in worker.py: > {code:python} > # check end of stream > if read_int(infile) == SpecialLengths.END_OF_STREAM: > write_int(SpecialLengths.END_OF_STREAM, outfile) > else: > # write a different value to tell JVM to not reuse this worker > write_int(SpecialLengths.END_OF_DATA_SECTION, outfile) > sys.exit(-1) > {code} > The code works well for Python2 and PyPy because the END_OF_DATA_SECTION has > been handled during load iterator from the socket stream, see the code in > FramedSerializer: > {code:python} > def load_stream(self, stream): > while True: > try: > yield self._read_with_length(stream) > except EOFError: > return > ... > def _read_with_length(self, stream): > length = read_int(stream) > if length == SpecialLengths.END_OF_DATA_SECTION: > raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in > load_stream > elif length == SpecialLengths.NULL: > return None > obj = stream.read(length) > if len(obj) < length: > raise EOFError > return self.loa
[jira] [Created] (SPARK-26549) PySpark worker reuse take no effect for Python3
Yuanjian Li created SPARK-26549: --- Summary: PySpark worker reuse take no effect for Python3 Key: SPARK-26549 URL: https://issues.apache.org/jira/browse/SPARK-26549 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.0 Reporter: Yuanjian Li During [the follow-up work|https://github.com/apache/spark/pull/23435#issuecomment-451079886] for PySpark worker reuse scenario, we found that the worker reuse takes no effect for Python3 while works properly for Python2 and PyPy. It happened because, during the python worker check end of the stream in Python3, we got an unexpected value -1 here which refers to END_OF_DATA_SECTION. See the code in worker.py: {code:python} # check end of stream if read_int(infile) == SpecialLengths.END_OF_STREAM: write_int(SpecialLengths.END_OF_STREAM, outfile) else: # write a different value to tell JVM to not reuse this worker write_int(SpecialLengths.END_OF_DATA_SECTION, outfile) sys.exit(-1) {code} The code works well for Python2 and PyPy cause the END_OF_DATA_SECTION has been handled during load iterator from the socket stream, see the code in FramedSerializer: {code:python} def load_stream(self, stream): while True: try: yield self._read_with_length(stream) except EOFError: return ... def _read_with_length(self, stream): length = read_int(stream) if length == SpecialLengths.END_OF_DATA_SECTION: raise EOFError #END_OF_DATA_SECTION raised EOF here and catched in load_stream elif length == SpecialLengths.NULL: return None obj = stream.read(length) if len(obj) < length: raise EOFError return self.loads(obj) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26548) Don't block during query optimization
[ https://issues.apache.org/jira/browse/SPARK-26548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26548: Assignee: (was: Apache Spark) > Don't block during query optimization > - > > Key: SPARK-26548 > URL: https://issues.apache.org/jira/browse/SPARK-26548 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dave DeCaprio >Priority: Minor > Labels: sql > > In Spark 2.4.0 the CacheManager was updated so it will not execute jobs while > it holds a lock.This was introduced in -SPARK-23880.- > The CacheManager still holds a write lock during the execution of the query > optimizer. For complex queries the optimizer can run for a long time (we see > 10-15 minutes for some exceptionally large queries). This allows only 1 > thread to optimize at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26548) Don't block during query optimization
[ https://issues.apache.org/jira/browse/SPARK-26548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26548: Assignee: Apache Spark > Don't block during query optimization > - > > Key: SPARK-26548 > URL: https://issues.apache.org/jira/browse/SPARK-26548 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dave DeCaprio >Assignee: Apache Spark >Priority: Minor > Labels: sql > > In Spark 2.4.0 the CacheManager was updated so it will not execute jobs while > it holds a lock.This was introduced in -SPARK-23880.- > The CacheManager still holds a write lock during the execution of the query > optimizer. For complex queries the optimizer can run for a long time (we see > 10-15 minutes for some exceptionally large queries). This allows only 1 > thread to optimize at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26535) Parsing literals as DOUBLE instead of DECIMAL
[ https://issues.apache.org/jira/browse/SPARK-26535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26535: Assignee: (was: Apache Spark) > Parsing literals as DOUBLE instead of DECIMAL > - > > Key: SPARK-26535 > URL: https://issues.apache.org/jira/browse/SPARK-26535 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Marco Gaido >Priority: Major > > As pointed out by [~dkbiswal]'s comment > https://github.com/apache/spark/pull/22450#issuecomment-423082389, most of > other RDBMS (DB2, Presto, Hive, MSSQL) consider literals as DOUBLE by default. > Spark as of now consider them as DECIMAL. This is quite problematic > especially in relation with the operations on decimal, for which we base our > implementation on Hive/MSSQL. > So this ticket is for moving by default the resolution of literals as DOUBLE, > but with a config which allows to get back to the previous behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26535) Parsing literals as DOUBLE instead of DECIMAL
[ https://issues.apache.org/jira/browse/SPARK-26535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26535: Assignee: Apache Spark > Parsing literals as DOUBLE instead of DECIMAL > - > > Key: SPARK-26535 > URL: https://issues.apache.org/jira/browse/SPARK-26535 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Marco Gaido >Assignee: Apache Spark >Priority: Major > > As pointed out by [~dkbiswal]'s comment > https://github.com/apache/spark/pull/22450#issuecomment-423082389, most of > other RDBMS (DB2, Presto, Hive, MSSQL) consider literals as DOUBLE by default. > Spark as of now consider them as DECIMAL. This is quite problematic > especially in relation with the operations on decimal, for which we base our > implementation on Hive/MSSQL. > So this ticket is for moving by default the resolution of literals as DOUBLE, > but with a config which allows to get back to the previous behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
[ https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734934#comment-16734934 ] Dongjoon Hyun edited comment on SPARK-25692 at 1/5/19 4:55 PM: --- Hi, [~zsxwing] and [~tgraves]. While looking other failures, I notice that this failure still happens frequently. The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might be related. - [master 5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport] (amp-jenkins-worker-05) - [master 5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport] (amp-jenkins-worker-05) - [master 5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100787|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100787/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100788|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100788/consoleFull] (amp-jenkins-worker-05) was (Author: dongjoon): Hi, [~zsxwing] and [~tgraves]. While looking other failures, I notice that this failure still happens frequently. The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might be related. - [master 5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport] (amp-jenkins-worker-05) - [master 5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport] (amp-jenkins-worker-05) - [master 5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull] (amp-jenkins-worker-05) > Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks > -- > > Key: SPARK-25692 > URL: https://issues.apache.org/jira/browse/SPARK-25692 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Priority: Blocker > Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot > 2018-11-01 at 10.17.16 AM.png > > > Looks like the whole test suite is pretty flaky. See: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ > This may be a regression in 3.0 as this didn't happen in 2.4 branch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
[ https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734934#comment-16734934 ] Dongjoon Hyun edited comment on SPARK-25692 at 1/5/19 4:53 PM: --- Hi, [~zsxwing] and [~tgraves]. While looking other failures, I notice that this failure still happens frequently. The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might be related. - [master 5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport] (amp-jenkins-worker-05) - [master 5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport] (amp-jenkins-worker-05) - [master 5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull] (amp-jenkins-worker-05) was (Author: dongjoon): Hi, [~zsxwing] and [~tgraves]. While looking other failures, I notice that this failure still happens frequently in Maven testing. The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might be related. - [master 5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport] (amp-jenkins-worker-05) - [master 5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport] (amp-jenkins-worker-05) - [master 5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull] (amp-jenkins-worker-05) > Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks > -- > > Key: SPARK-25692 > URL: https://issues.apache.org/jira/browse/SPARK-25692 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Priority: Blocker > Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot > 2018-11-01 at 10.17.16 AM.png > > > Looks like the whole test suite is pretty flaky. See: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ > This may be a regression in 3.0 as this didn't happen in 2.4 branch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
[ https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734934#comment-16734934 ] Dongjoon Hyun edited comment on SPARK-25692 at 1/5/19 4:52 PM: --- Hi, [~zsxwing] and [~tgraves]. While looking other failures, I notice that this failure still happens frequently in Maven testing. The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might be related. - [master 5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport] (amp-jenkins-worker-05) - [master 5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport] (amp-jenkins-worker-05) - [master 5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100784|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100784/consoleFull] (amp-jenkins-worker-05) - [SparkPullRequestBuilder 100785|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100785/consoleFull] (amp-jenkins-worker-05) was (Author: dongjoon): Hi, [~zsxwing] and [~tgraves]. While looking other failures, I notice that this failure still happens frequently in Maven testing. The failure is always `fetchBothChunks`. Can we increase the timeout from 5 second to 10 (or 20) second? Does it hide the underlying real issue? - [5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport] - [5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport] - [5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport] > Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks > -- > > Key: SPARK-25692 > URL: https://issues.apache.org/jira/browse/SPARK-25692 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Priority: Blocker > Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot > 2018-11-01 at 10.17.16 AM.png > > > Looks like the whole test suite is pretty flaky. See: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ > This may be a regression in 3.0 as this didn't happen in 2.4 branch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
[ https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25692: Assignee: Apache Spark > Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks > -- > > Key: SPARK-25692 > URL: https://issues.apache.org/jira/browse/SPARK-25692 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Blocker > Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot > 2018-11-01 at 10.17.16 AM.png > > > Looks like the whole test suite is pretty flaky. See: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ > This may be a regression in 3.0 as this didn't happen in 2.4 branch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
[ https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25692: Assignee: (was: Apache Spark) > Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks > -- > > Key: SPARK-25692 > URL: https://issues.apache.org/jira/browse/SPARK-25692 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Priority: Blocker > Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot > 2018-11-01 at 10.17.16 AM.png > > > Looks like the whole test suite is pretty flaky. See: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ > This may be a regression in 3.0 as this didn't happen in 2.4 branch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26547) Remove duplicate toHiveString from HiveUtils
Maxim Gekk created SPARK-26547: -- Summary: Remove duplicate toHiveString from HiveUtils Key: SPARK-26547 URL: https://issues.apache.org/jira/browse/SPARK-26547 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk The toHiveString method is already implemented in the HiveResult object. The method can be removed from HiveUtils. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26548) Don't block during query optimization
[ https://issues.apache.org/jira/browse/SPARK-26548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734935#comment-16734935 ] Dave DeCaprio commented on SPARK-26548: --- I have a fix and am creating a PR for this. > Don't block during query optimization > - > > Key: SPARK-26548 > URL: https://issues.apache.org/jira/browse/SPARK-26548 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dave DeCaprio >Priority: Minor > Labels: sql > > In Spark 2.4.0 the CacheManager was updated so it will not execute jobs while > it holds a lock.This was introduced in -SPARK-23880.- > The CacheManager still holds a write lock during the execution of the query > optimizer. For complex queries the optimizer can run for a long time (we see > 10-15 minutes for some exceptionally large queries). This allows only 1 > thread to optimize at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
[ https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25692: -- Summary: Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks (was: Flaky test: ChunkFetchIntegrationSuite) > Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks > -- > > Key: SPARK-25692 > URL: https://issues.apache.org/jira/browse/SPARK-25692 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Priority: Blocker > Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot > 2018-11-01 at 10.17.16 AM.png > > > Looks like the whole test suite is pretty flaky. See: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ > This may be a regression in 3.0 as this didn't happen in 2.4 branch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
[ https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734934#comment-16734934 ] Dongjoon Hyun commented on SPARK-25692: --- Hi, [~zsxwing] and [~tgraves]. While looking other failures, I notice that this failure still happens frequently in Maven testing. The failure is always `fetchBothChunks`. Can we increase the timeout from 5 second to 10 (or 20) second? Does it hide the underlying real issue? - [5828|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5828/testReport] - [5822|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5822/testReport] - [5814|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5814/testReport] > Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks > -- > > Key: SPARK-25692 > URL: https://issues.apache.org/jira/browse/SPARK-25692 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Priority: Blocker > Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot > 2018-11-01 at 10.17.16 AM.png > > > Looks like the whole test suite is pretty flaky. See: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ > This may be a regression in 3.0 as this didn't happen in 2.4 branch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26548) Don't block during query optimization
Dave DeCaprio created SPARK-26548: - Summary: Don't block during query optimization Key: SPARK-26548 URL: https://issues.apache.org/jira/browse/SPARK-26548 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Dave DeCaprio In Spark 2.4.0 the CacheManager was updated so it will not execute jobs while it holds a lock.This was introduced in -SPARK-23880.- The CacheManager still holds a write lock during the execution of the query optimizer. For complex queries the optimizer can run for a long time (we see 10-15 minutes for some exceptionally large queries). This allows only 1 thread to optimize at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26547) Remove duplicate toHiveString from HiveUtils
[ https://issues.apache.org/jira/browse/SPARK-26547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26547: Assignee: Apache Spark > Remove duplicate toHiveString from HiveUtils > > > Key: SPARK-26547 > URL: https://issues.apache.org/jira/browse/SPARK-26547 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > The toHiveString method is already implemented in the HiveResult object. The > method can be removed from HiveUtils. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26547) Remove duplicate toHiveString from HiveUtils
[ https://issues.apache.org/jira/browse/SPARK-26547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26547: Assignee: (was: Apache Spark) > Remove duplicate toHiveString from HiveUtils > > > Key: SPARK-26547 > URL: https://issues.apache.org/jira/browse/SPARK-26547 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > The toHiveString method is already implemented in the HiveResult object. The > method can be removed from HiveUtils. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26540) Support PostgreSQL numeric arrays without precision/scale
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734932#comment-16734932 ] Dongjoon Hyun commented on SPARK-26540: --- [~mgaido] requested to close this because SPARK-26538 is created first and has PR before this. Please see the PR. I closed mine. > Support PostgreSQL numeric arrays without precision/scale > - > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26540) Support PostgreSQL numeric arrays without precision/scale
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26540. --- Resolution: Duplicate > Support PostgreSQL numeric arrays without precision/scale > - > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26280) Spark will read entire CSV file even when limit is used
[ https://issues.apache.org/jira/browse/SPARK-26280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26280. -- Resolution: Duplicate > Spark will read entire CSV file even when limit is used > --- > > Key: SPARK-26280 > URL: https://issues.apache.org/jira/browse/SPARK-26280 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Amir Bar-Or >Priority: Major > > When you read CSV as below , the parser still waste time and read the entire > file: > var lineDF1 = spark.read > .format("com.databricks.spark.csv") > .option("header", "true") //reading the headers > .option("mode", "DROPMALFORMED") > .option("delimiter",",") > .option("inferSchema", "false") > .schema(line_schema) > .load(i_lineitem) > .lineDF1.limit(10) > > Even though a LocalLimit is created , this does not stop the FileScan and > the parser from parsing entire file. Is it possible to push the limit down > and stop the parsing ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26336) left_anti join with Na Values
[ https://issues.apache.org/jira/browse/SPARK-26336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26336. -- Resolution: Invalid > left_anti join with Na Values > - > > Key: SPARK-26336 > URL: https://issues.apache.org/jira/browse/SPARK-26336 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Carlos >Priority: Major > > When I'm joining two dataframes with data that haves NA values, the left_anti > join don't work as well, cause don't detect registers with NA values. > Example: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.functions import * > spark = SparkSession.builder.appName('test').enableHiveSupport().getOrCreate() > data = [(1,"Test"),(2,"Test"),(3,None)] > df1 = spark.createDataFrame(data,("id","columndata")) > df2 = spark.createDataFrame(data,("id","columndata")) > df_joined = df1.join(df2, df1.columns,'left_anti'){code} > df_joined have data, when two dataframe are the same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-26542) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-26542. - > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26542 > URL: https://issues.apache.org/jira/browse/SPARK-26542 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale
[ https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26155. -- Resolution: Duplicate > Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS > in 3TB scale > -- > > Key: SPARK-26155 > URL: https://issues.apache.org/jira/browse/SPARK-26155 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis > in Spark2.3 without L486&487.pdf, q19.sql, tpcds.result.xlsx > > > In our test environment, we found a serious performance degradation issue in > Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious > performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark > 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated > this problem and figured out the root cause is in community patch SPARK-21052 > which add metrics to hash join process. And the impact code is > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > . Q19 costs about 30 seconds without these two lines code and 126 seconds > with these code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26543: - Target Version/s: (was: 2.3.0) > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > Attachments: SPARK-26543.patch, image-2019-01-05-13-18-30-487.png > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !image-2019-01-05-13-18-30-487.png! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734916#comment-16734916 ] Hyukjin Kwon commented on SPARK-26543: -- Also, Spark doesn't use patch but PRs. Please take a look for https://spark.apache.org/contributing.html > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Attachments: SPARK-26543.patch, image-2019-01-05-13-18-30-487.png > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !image-2019-01-05-13-18-30-487.png! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26543: - Fix Version/s: (was: 2.3.0) > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Attachments: SPARK-26543.patch, image-2019-01-05-13-18-30-487.png > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !image-2019-01-05-13-18-30-487.png! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734914#comment-16734914 ] Hyukjin Kwon commented on SPARK-26543: -- Please avoid to set target version which is usually reserved for committers, and fix version which is usually set when it's actually fixed. > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Attachments: SPARK-26543.patch, image-2019-01-05-13-18-30-487.png > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !image-2019-01-05-13-18-30-487.png! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26542) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26542. -- Resolution: Duplicate > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26542 > URL: https://issues.apache.org/jira/browse/SPARK-26542 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26383) NPE when use DataFrameReader.jdbc with wrong URL
[ https://issues.apache.org/jira/browse/SPARK-26383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26383: Assignee: Apache Spark > NPE when use DataFrameReader.jdbc with wrong URL > > > Key: SPARK-26383 > URL: https://issues.apache.org/jira/browse/SPARK-26383 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: clouds >Assignee: Apache Spark >Priority: Minor > > When passing wrong url to jdbc: > {code:java} > val opts = Map( > "url" -> "jdbc:mysql://localhost/db", > "dbtable" -> "table", > "driver" -> "org.postgresql.Driver" > ) > var df = spark.read.format("jdbc").options(opts).load > {code} > It would throw an NPE instead of complaining about connection failed. (Note > url and driver not matched here) > {code:java} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) > {code} > as [postgresql jdbc driver > document|https://jdbc.postgresql.org/development/privateapi/org/postgresql/Driver.html#connect-java.lang.String-java.util.Properties-] > saying, The driver should return "null" if it realizes it is the wrong kind > of driver to connect to the given URL. > while > [ConnectionFactory|https://github.com/apache/spark/blob/e743e848484bf7d97e1b4f33ea83f8520ae7da04/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L56] > would not check if conn is null. > {code:java} > val conn: Connection = JdbcUtils.createConnectionFactory(options)() > {code} > and trying to close the conn anyway > {code:java} > try { > ... > } finally { > conn.close() > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26383) NPE when use DataFrameReader.jdbc with wrong URL
[ https://issues.apache.org/jira/browse/SPARK-26383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26383: Assignee: (was: Apache Spark) > NPE when use DataFrameReader.jdbc with wrong URL > > > Key: SPARK-26383 > URL: https://issues.apache.org/jira/browse/SPARK-26383 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: clouds >Priority: Minor > > When passing wrong url to jdbc: > {code:java} > val opts = Map( > "url" -> "jdbc:mysql://localhost/db", > "dbtable" -> "table", > "driver" -> "org.postgresql.Driver" > ) > var df = spark.read.format("jdbc").options(opts).load > {code} > It would throw an NPE instead of complaining about connection failed. (Note > url and driver not matched here) > {code:java} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) > {code} > as [postgresql jdbc driver > document|https://jdbc.postgresql.org/development/privateapi/org/postgresql/Driver.html#connect-java.lang.String-java.util.Properties-] > saying, The driver should return "null" if it realizes it is the wrong kind > of driver to connect to the given URL. > while > [ConnectionFactory|https://github.com/apache/spark/blob/e743e848484bf7d97e1b4f33ea83f8520ae7da04/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L56] > would not check if conn is null. > {code:java} > val conn: Connection = JdbcUtils.createConnectionFactory(options)() > {code} > and trying to close the conn anyway > {code:java} > try { > ... > } finally { > conn.close() > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26078) WHERE .. IN fails to filter rows when used in combination with UNION
[ https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26078: -- Fix Version/s: 2.3.3 > WHERE .. IN fails to filter rows when used in combination with UNION > > > Key: SPARK-26078 > URL: https://issues.apache.org/jira/browse/SPARK-26078 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 >Reporter: Arttu Voutilainen >Assignee: Marco Gaido >Priority: Blocker > Labels: correctness > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > Hey, > We encountered a case where Spark SQL does not seem to handle WHERE .. IN > correctly, when used in combination with UNION, but instead returns also rows > that do not fulfill the condition. Swapping the order of the datasets in the > UNION makes the problem go away. Repro below: > > {code} > sql = SQLContext(sc) > a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}]) > b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}]) > a.registerTempTable('a') > b.registerTempTable('b') > bug = sql.sql(""" > SELECT id,num,source FROM > ( > SELECT id, num, 'a' as source FROM a > UNION ALL > SELECT id, num, 'b' as source FROM b > ) AS c > WHERE c.id IN (SELECT id FROM b WHERE num = 2) > """) > no_bug = sql.sql(""" > SELECT id,num,source FROM > ( > SELECT id, num, 'b' as source FROM b > UNION ALL > SELECT id, num, 'a' as source FROM a > ) AS c > WHERE c.id IN (SELECT id FROM b WHERE num = 2) > """) > bug.show() > no_bug.show() > bug.explain(True) > no_bug.explain(True) > {code} > This results in one extra row in the "bug" DF coming from DF "b", that should > not be there as it > {code:java} > >>> bug.show() > +---+---+--+ > | id|num|source| > +---+---+--+ > | a| 2| a| > | a| 2| b| > | b| 1| b| > +---+---+--+ > >>> no_bug.show() > +---+---+--+ > | id|num|source| > +---+---+--+ > | a| 2| b| > | a| 2| a| > +---+---+--+ > {code} > The reason can be seen in the query plans: > {code:java} > >>> bug.explain(True) > ... > == Optimized Logical Plan == > Union > :- Project [id#0, num#1L, a AS source#136] > : +- Join LeftSemi, (id#0 = id#4) > : :- LogicalRDD [id#0, num#1L], false > : +- Project [id#4] > :+- Filter (isnotnull(num#5L) && (num#5L = 2)) > : +- LogicalRDD [id#4, num#5L], false > +- Join LeftSemi, (id#4#172 = id#4#172) >:- Project [id#4, num#5L, b AS source#137] >: +- LogicalRDD [id#4, num#5L], false >+- Project [id#4 AS id#4#172] > +- Filter (isnotnull(num#5L) && (num#5L = 2)) > +- LogicalRDD [id#4, num#5L], false > {code} > Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition > seems wrong, and I believe it causes the LeftSemi to return true for all rows > in the left-hand-side table, thus failing to filter as the WHERE .. IN > should. Compare with the non-buggy version, where both LeftSemi joins have > distinct #-things on both sides: > {code:java} > >>> no_bug.explain() > ... > == Optimized Logical Plan == > Union > :- Project [id#4, num#5L, b AS source#142] > : +- Join LeftSemi, (id#4 = id#4#173) > : :- LogicalRDD [id#4, num#5L], false > : +- Project [id#4 AS id#4#173] > :+- Filter (isnotnull(num#5L) && (num#5L = 2)) > : +- LogicalRDD [id#4, num#5L], false > +- Project [id#0, num#1L, a AS source#143] >+- Join LeftSemi, (id#0 = id#4#173) > :- LogicalRDD [id#0, num#1L], false > +- Project [id#4 AS id#4#173] > +- Filter (isnotnull(num#5L) && (num#5L = 2)) > +- LogicalRDD [id#4, num#5L], false > {code} > > Best, > -Arttu > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26540) Support PostgreSQL numeric arrays without precision/scale
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734887#comment-16734887 ] Takeshi Yamamuro commented on SPARK-26540: -- We need to close SPARK-26538 as duplicated when resolving this. > Support PostgreSQL numeric arrays without precision/scale > - > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26546) Caching of DateTimeFormatter
[ https://issues.apache.org/jira/browse/SPARK-26546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26546: Assignee: Apache Spark > Caching of DateTimeFormatter > > > Key: SPARK-26546 > URL: https://issues.apache.org/jira/browse/SPARK-26546 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Currently, instances of java.time.format.DateTimeFormatter are built each > time when new instance of Iso8601DateFormatter or Iso8601TimestampFormatter > is created which is time consuming operation because it should parse the > timestamp/date patterns. It could be useful to create a cache with key = > (pattern, locale) and value = instance of java.time.format.DateTimeFormatter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26546) Caching of DateTimeFormatter
[ https://issues.apache.org/jira/browse/SPARK-26546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26546: Assignee: (was: Apache Spark) > Caching of DateTimeFormatter > > > Key: SPARK-26546 > URL: https://issues.apache.org/jira/browse/SPARK-26546 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, instances of java.time.format.DateTimeFormatter are built each > time when new instance of Iso8601DateFormatter or Iso8601TimestampFormatter > is created which is time consuming operation because it should parse the > timestamp/date patterns. It could be useful to create a cache with key = > (pattern, locale) and value = instance of java.time.format.DateTimeFormatter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26546) Caching of DateTimeFormatter
Maxim Gekk created SPARK-26546: -- Summary: Caching of DateTimeFormatter Key: SPARK-26546 URL: https://issues.apache.org/jira/browse/SPARK-26546 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, instances of java.time.format.DateTimeFormatter are built each time when new instance of Iso8601DateFormatter or Iso8601TimestampFormatter is created which is time consuming operation because it should parse the timestamp/date patterns. It could be useful to create a cache with key = (pattern, locale) and value = instance of java.time.format.DateTimeFormatter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26545) Fix typo in EqualNullSafe's truth table comment
[ https://issues.apache.org/jira/browse/SPARK-26545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26545: Assignee: Apache Spark > Fix typo in EqualNullSafe's truth table comment > --- > > Key: SPARK-26545 > URL: https://issues.apache.org/jira/browse/SPARK-26545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kris Mok >Assignee: Apache Spark >Priority: Trivial > > The truth table comment in {{EqualNullSafe}} incorrectly marked FALSE results > as UNKNOWN -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26545) Fix typo in EqualNullSafe's truth table comment
[ https://issues.apache.org/jira/browse/SPARK-26545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26545: Assignee: (was: Apache Spark) > Fix typo in EqualNullSafe's truth table comment > --- > > Key: SPARK-26545 > URL: https://issues.apache.org/jira/browse/SPARK-26545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kris Mok >Priority: Trivial > > The truth table comment in {{EqualNullSafe}} incorrectly marked FALSE results > as UNKNOWN -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26545) Fix typo in EqualNullSafe's truth table comment
Kris Mok created SPARK-26545: Summary: Fix typo in EqualNullSafe's truth table comment Key: SPARK-26545 URL: https://issues.apache.org/jira/browse/SPARK-26545 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Kris Mok The truth table comment in {{EqualNullSafe}} incorrectly marked FALSE results as UNKNOWN -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26540) Support PostgreSQL numeric arrays without precision/scale
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26540: -- Summary: Support PostgreSQL numeric arrays without precision/scale (was: Requirement failed when reading numeric arrays from PostgreSQL) > Support PostgreSQL numeric arrays without precision/scale > - > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26541) Add `-Pdocker-integration-tests` to `dev/scalastyle`
[ https://issues.apache.org/jira/browse/SPARK-26541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26541. --- Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/23459 > Add `-Pdocker-integration-tests` to `dev/scalastyle` > > > Key: SPARK-26541 > URL: https://issues.apache.org/jira/browse/SPARK-26541 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > > This issue makes `scalastyle` to check `docker-integration-tests` module and > fixes one error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26373) Spark UI 'environment' tab - column to indicate default vs overridden values
[ https://issues.apache.org/jira/browse/SPARK-26373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734814#comment-16734814 ] Pablo Langa Blanco commented on SPARK-26373: Hi [~toopt4], Could you explain what is the utility of it? I'm thinking about it. When you start a spark application you know what are the properties you have set (throught park-defaults.conf, SparkConf, or the command line) and all the properties that you dont set, are available in the documentation [https://spark.apache.org/docs/latest/configuration.html] Thanks for the proposall! > Spark UI 'environment' tab - column to indicate default vs overridden values > > > Key: SPARK-26373 > URL: https://issues.apache.org/jira/browse/SPARK-26373 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.4.0 >Reporter: t oo >Priority: Major > > Rather than just showing name and value for each property, a new column would > also show whether the value is default (show 'AS PER DEFAULT') or if its > overridden (show the actual default value). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734810#comment-16734810 ] chenliang edited comment on SPARK-26543 at 1/5/19 8:30 AM: --- [~r...@databricks.com][~markhamstra][~cloud_fan] Could you please help to look at this,thank you! was (Author: southernriver): [~r...@databricks.com][~markhamstra][~cloud_fan] Could you please help look at this,thank you! > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > Attachments: SPARK-26543.patch, image-2019-01-05-13-18-30-487.png > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !image-2019-01-05-13-18-30-487.png! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734810#comment-16734810 ] chenliang commented on SPARK-26543: --- [~r...@databricks.com][~markhamstra][~cloud_fan] Could you please help look at this,thank you! > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > Attachments: SPARK-26543.patch, image-2019-01-05-13-18-30-487.png > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !image-2019-01-05-13-18-30-487.png! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-26543: -- Attachment: SPARK-26543.patch > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > Attachments: SPARK-26543.patch, image-2019-01-05-13-18-30-487.png > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !image-2019-01-05-13-18-30-487.png! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26544) escape string when serialize map/array to make it a valid json (keep alignment with hive)
[ https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734809#comment-16734809 ] Apache Spark commented on SPARK-26544: -- User 'WangGuangxin' has created a pull request for this issue: https://github.com/apache/spark/pull/23460 > escape string when serialize map/array to make it a valid json (keep > alignment with hive) > - > > Key: SPARK-26544 > URL: https://issues.apache.org/jira/browse/SPARK-26544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: EdisonWang >Priority: Major > > when reading a hive table with map/array type, the string serialized by > thrift server is not a valid json, while hive is. > For example, select a field whose type is map, the spark > thrift server returns > > {code:java} > {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"} > {code} > > while hive thriftserver returns > > {code:java} > {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"} > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26544) escape string when serialize map/array to make it a valid json (keep alignment with hive)
[ https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26544: Assignee: Apache Spark > escape string when serialize map/array to make it a valid json (keep > alignment with hive) > - > > Key: SPARK-26544 > URL: https://issues.apache.org/jira/browse/SPARK-26544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: EdisonWang >Assignee: Apache Spark >Priority: Major > > when reading a hive table with map/array type, the string serialized by > thrift server is not a valid json, while hive is. > For example, select a field whose type is map, the spark > thrift server returns > > {code:java} > {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"} > {code} > > while hive thriftserver returns > > {code:java} > {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"} > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26544) escape string when serialize map/array to make it a valid json (keep alignment with hive)
[ https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26544: Assignee: (was: Apache Spark) > escape string when serialize map/array to make it a valid json (keep > alignment with hive) > - > > Key: SPARK-26544 > URL: https://issues.apache.org/jira/browse/SPARK-26544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: EdisonWang >Priority: Major > > when reading a hive table with map/array type, the string serialized by > thrift server is not a valid json, while hive is. > For example, select a field whose type is map, the spark > thrift server returns > > {code:java} > {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"} > {code} > > while hive thriftserver returns > > {code:java} > {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"} > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26544) escape string when serialize map/array to make it a valid json (keep alignment with hive)
[ https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734807#comment-16734807 ] Apache Spark commented on SPARK-26544: -- User 'WangGuangxin' has created a pull request for this issue: https://github.com/apache/spark/pull/23460 > escape string when serialize map/array to make it a valid json (keep > alignment with hive) > - > > Key: SPARK-26544 > URL: https://issues.apache.org/jira/browse/SPARK-26544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: EdisonWang >Priority: Major > > when reading a hive table with map/array type, the string serialized by > thrift server is not a valid json, while hive is. > For example, select a field whose type is map, the spark > thrift server returns > > {code:java} > {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"} > {code} > > while hive thriftserver returns > > {code:java} > {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"} > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org