[jira] [Resolved] (SPARK-30784) Hive 2.3 profile should still use orc-nohive
[ https://issues.apache.org/jira/browse/SPARK-30784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-30784. -- Resolution: Not A Bug Resolving it because with Hive 2.3, using regular orc is required. > Hive 2.3 profile should still use orc-nohive > > > Key: SPARK-30784 > URL: https://issues.apache.org/jira/browse/SPARK-30784 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yin Huai >Priority: Critical > > Originally reported at > [https://github.com/apache/spark/pull/26619#issuecomment-583802901] > > Right now, Hive 2.3 profile pulls in regular orc, which depends on > hive-storage-api. However, hive-storage-api and hive-common have the > following common class files > > org/apache/hadoop/hive/common/ValidReadTxnList.class > org/apache/hadoop/hive/common/ValidTxnList.class > org/apache/hadoop/hive/common/ValidTxnList$RangeResponse.class > For example, > [https://github.com/apache/hive/blob/rel/storage-release-2.6.0/storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java] > (pulled in by orc 1.5.8) and > [https://github.com/apache/hive/blob/rel/release-2.3.6/common/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java] > (from hive-common 2.3.6) both are in the classpath and they are different. > Having both versions in the classpath can cause unexpected behavior due to > classloading order. We should still use orc-nohive, which has > hive-storage-api shaded. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-30976) Improve Maven Install Logic in build/mvn
[ https://issues.apache.org/jira/browse/SPARK-30976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai deleted SPARK-30976: - > Improve Maven Install Logic in build/mvn > > > Key: SPARK-30976 > URL: https://issues.apache.org/jira/browse/SPARK-30976 > Project: Spark > Issue Type: Improvement >Reporter: Wesley Hsiao >Priority: Major > > The current code at lacks a validation step to test the installed maven > binary at This is a point of failure where apache jenkins machine jobs can > fail where a maven binary can fail to run due to a corrupted download from an > apache mirror. > To improve the stability of apache jenkins machine builds, a maven binary > test logic should be added after maven download to verify that the maven > binary works. If it doesn't pass the test, then download and install from > archive apache rep -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30784) Hive 2.3 profile should still use orc-nohive
[ https://issues.apache.org/jira/browse/SPARK-30784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-30784: - Description: Originally reported at [https://github.com/apache/spark/pull/26619#issuecomment-583802901] Right now, Hive 2.3 profile pulls in regular orc, which depends on hive-storage-api. However, hive-storage-api and hive-common have the following common class files org/apache/hadoop/hive/common/ValidReadTxnList.class org/apache/hadoop/hive/common/ValidTxnList.class org/apache/hadoop/hive/common/ValidTxnList$RangeResponse.class For example, [https://github.com/apache/hive/blob/rel/storage-release-2.6.0/storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java] (pulled in by orc 1.5.8) and [https://github.com/apache/hive/blob/rel/release-2.3.6/common/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java] (from hive-common 2.3.6) both are in the classpath and they are different. Having both versions in the classpath can cause unexpected behavior due to classloading order. We should still use orc-nohive, which has hive-storage-api shaded. was: Originally reported at [https://github.com/apache/spark/pull/26619#issuecomment-583802901] Right now, Hive 2.3 profile pulls in regular orc, which depends on hive-storage-api. However, hive-storage-api and hive-common have the following common class files {{org/apache/hadoop/hive/common/ValidReadTxnList.class org/apache/hadoop/hive/common/ValidTxnList.class org/apache/hadoop/hive/common/ValidTxnList$RangeResponse.class}} For example, [https://github.com/apache/hive/blob/rel/storage-release-2.6.0/storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java] (pulled in by orc 1.5.8) and [https://github.com/apache/hive/blob/rel/release-2.3.6/common/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java] (from hive-common 2.3.6) both are in the classpath and they are different. Having both versions in the classpath can cause unexpected behavior due to classloading order. We should still use orc-nohive, which has hive-storage-api shaded. > Hive 2.3 profile should still use orc-nohive > > > Key: SPARK-30784 > URL: https://issues.apache.org/jira/browse/SPARK-30784 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yin Huai >Priority: Blocker > > Originally reported at > [https://github.com/apache/spark/pull/26619#issuecomment-583802901] > > Right now, Hive 2.3 profile pulls in regular orc, which depends on > hive-storage-api. However, hive-storage-api and hive-common have the > following common class files > > org/apache/hadoop/hive/common/ValidReadTxnList.class > org/apache/hadoop/hive/common/ValidTxnList.class > org/apache/hadoop/hive/common/ValidTxnList$RangeResponse.class > For example, > [https://github.com/apache/hive/blob/rel/storage-release-2.6.0/storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java] > (pulled in by orc 1.5.8) and > [https://github.com/apache/hive/blob/rel/release-2.3.6/common/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java] > (from hive-common 2.3.6) both are in the classpath and they are different. > Having both versions in the classpath can cause unexpected behavior due to > classloading order. We should still use orc-nohive, which has > hive-storage-api shaded. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30783) Hive 2.3 profile should exclude hive-service-rpc
[ https://issues.apache.org/jira/browse/SPARK-30783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-30783: - Attachment: hive-service-rpc-2.3.6-classes spark-hive-thriftserver_2.12-3.0.0-20200207.021914-364-classes > Hive 2.3 profile should exclude hive-service-rpc > > > Key: SPARK-30783 > URL: https://issues.apache.org/jira/browse/SPARK-30783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > Attachments: hive-service-rpc-2.3.6-classes, > spark-hive-thriftserver_2.12-3.0.0-20200207.021914-364-classes > > > hive-service-rpc 2.3.6 and spark sql's thrift server module have duplicate > classes. Leaving hive-service-rpc 2.3.6 in the class path means that spark > can pick up classes defined in hive instead of its thrift server module, > which can cause hard to debug runtime errors due to class loading order and > compilation errors for applications depend on spark. > > If you compare hive-service-rpc 2.3.6's jar > ([https://search.maven.org/remotecontent?filepath=org/apache/hive/hive-service-rpc/2.3.6/hive-service-rpc-2.3.6.jar]) > and spark thrift server's jar (e.g. > [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-hive-thriftserver_2.12/3.0.0-SNAPSHOT/spark-hive-thriftserver_2.12-3.0.0-20200207.021914-364.jar),] > you will see that all of classes provided by hive-service-rpc-2.3.6.jar are > covered by spark thrift server's jar. I am attaching the list of jar contents > for your reference. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30783) Hive 2.3 profile should exclude hive-service-rpc
[ https://issues.apache.org/jira/browse/SPARK-30783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-30783: - Description: hive-service-rpc 2.3.6 and spark sql's thrift server module have duplicate classes. Leaving hive-service-rpc 2.3.6 in the class path means that spark can pick up classes defined in hive instead of its thrift server module, which can cause hard to debug runtime errors due to class loading order and compilation errors for applications depend on spark. If you compare hive-service-rpc 2.3.6's jar ([https://search.maven.org/remotecontent?filepath=org/apache/hive/hive-service-rpc/2.3.6/hive-service-rpc-2.3.6.jar]) and spark thrift server's jar (e.g. [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-hive-thriftserver_2.12/3.0.0-SNAPSHOT/spark-hive-thriftserver_2.12-3.0.0-20200207.021914-364.jar),] you will see that all of classes provided by hive-service-rpc-2.3.6.jar are covered by spark thrift server's jar. I am attaching the list of jar contents for your reference. was:hive-service-rpc 2.3.6 and spark sql's thrift server module have duplicate classes. Leaving hive-service-rpc 2.3.6 in the class path means that spark can pick up classes defined in hive instead of its thrift server module, which can cause hard to debug runtime errors due to class loading order and compilation errors for applications depend on spark. > Hive 2.3 profile should exclude hive-service-rpc > > > Key: SPARK-30783 > URL: https://issues.apache.org/jira/browse/SPARK-30783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > hive-service-rpc 2.3.6 and spark sql's thrift server module have duplicate > classes. Leaving hive-service-rpc 2.3.6 in the class path means that spark > can pick up classes defined in hive instead of its thrift server module, > which can cause hard to debug runtime errors due to class loading order and > compilation errors for applications depend on spark. > > If you compare hive-service-rpc 2.3.6's jar > ([https://search.maven.org/remotecontent?filepath=org/apache/hive/hive-service-rpc/2.3.6/hive-service-rpc-2.3.6.jar]) > and spark thrift server's jar (e.g. > [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-hive-thriftserver_2.12/3.0.0-SNAPSHOT/spark-hive-thriftserver_2.12-3.0.0-20200207.021914-364.jar),] > you will see that all of classes provided by hive-service-rpc-2.3.6.jar are > covered by spark thrift server's jar. I am attaching the list of jar contents > for your reference. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30784) Hive 2.3 profile should still use orc-nohive
Yin Huai created SPARK-30784: Summary: Hive 2.3 profile should still use orc-nohive Key: SPARK-30784 URL: https://issues.apache.org/jira/browse/SPARK-30784 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Yin Huai Originally reported at [https://github.com/apache/spark/pull/26619#issuecomment-583802901] Right now, Hive 2.3 profile pulls in regular orc, which depends on hive-storage-api. However, hive-storage-api and hive-common have the following common class files {{org/apache/hadoop/hive/common/ValidReadTxnList.class org/apache/hadoop/hive/common/ValidTxnList.class org/apache/hadoop/hive/common/ValidTxnList$RangeResponse.class}} For example, [https://github.com/apache/hive/blob/rel/storage-release-2.6.0/storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java] (pulled in by orc 1.5.8) and [https://github.com/apache/hive/blob/rel/release-2.3.6/common/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java] (from hive-common 2.3.6) both are in the classpath and they are different. Having both versions in the classpath can cause unexpected behavior due to classloading order. We should still use orc-nohive, which has hive-storage-api shaded. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30783) Hive 2.3 profile should exclude hive-service-rpc
[ https://issues.apache.org/jira/browse/SPARK-30783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-30783: Assignee: Yin Huai > Hive 2.3 profile should exclude hive-service-rpc > > > Key: SPARK-30783 > URL: https://issues.apache.org/jira/browse/SPARK-30783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > hive-service-rpc 2.3.6 and spark sql's thrift server module have duplicate > classes. Leaving hive-service-rpc 2.3.6 in the class path means that spark > can pick up classes defined in hive instead of its thrift server module, > which can cause hard to debug runtime errors due to class loading order and > compilation errors for applications depend on spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30783) Hive 2.3 profile should exclude hive-service-rpc
Yin Huai created SPARK-30783: Summary: Hive 2.3 profile should exclude hive-service-rpc Key: SPARK-30783 URL: https://issues.apache.org/jira/browse/SPARK-30783 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yin Huai hive-service-rpc 2.3.6 and spark sql's thrift server module have duplicate classes. Leaving hive-service-rpc 2.3.6 in the class path means that spark can pick up classes defined in hive instead of its thrift server module, which can cause hard to debug runtime errors due to class loading order and compilation errors for applications depend on spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30450) Exclude .git folder for python linter
[ https://issues.apache.org/jira/browse/SPARK-30450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-30450: - Affects Version/s: (was: 2.4.4) 3.0.0 > Exclude .git folder for python linter > - > > Key: SPARK-30450 > URL: https://issues.apache.org/jira/browse/SPARK-30450 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Minor > > The python linter shouldn't include the .git folder. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30450) Exclude .git folder for python linter
[ https://issues.apache.org/jira/browse/SPARK-30450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-30450: - Priority: Minor (was: Major) > Exclude .git folder for python linter > - > > Key: SPARK-30450 > URL: https://issues.apache.org/jira/browse/SPARK-30450 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Minor > > The python linter shouldn't include the .git folder. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30450) Exclude .git folder for python linter
[ https://issues.apache.org/jira/browse/SPARK-30450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-30450: Assignee: Eric Chang > Exclude .git folder for python linter > - > > Key: SPARK-30450 > URL: https://issues.apache.org/jira/browse/SPARK-30450 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Major > > The python linter shouldn't include the .git folder. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core
[ https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-25019. -- Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.4.0 [https://github.com/apache/spark/pull/22003] has been merged. > The published spark sql pom does not exclude the normal version of orc-core > > > Key: SPARK-25019 > URL: https://issues.apache.org/jira/browse/SPARK-25019 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.4.0 >Reporter: Yin Huai >Assignee: Dongjoon Hyun >Priority: Critical > Fix For: 2.4.0 > > > I noticed that > [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom] > does not exclude the normal version of orc-core. Comparing with > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108] > and > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,] > we only exclude the normal version of orc-core in the parent pom. So, the > problem is that if a developer depends on spark-sql-core directly, orc-core > and orc-core-nohive will be in the dependency list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core
[ https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568554#comment-16568554 ] Yin Huai commented on SPARK-25019: -- [~dongjoon] can you help us fix this issue? Or there is a reason that the parent pom and sql/core/pom are not consistent? > The published spark sql pom does not exclude the normal version of orc-core > > > Key: SPARK-25019 > URL: https://issues.apache.org/jira/browse/SPARK-25019 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.4.0 >Reporter: Yin Huai >Priority: Critical > > I noticed that > [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom] > does not exclude the normal version of orc-core. Comparing with > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108] > and > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,] > we only exclude the normal version of orc-core in the parent pom. So, the > problem is that if a developer depends on spark-sql-core directly, orc-core > and orc-core-nohive will be in the dependency list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core
Yin Huai created SPARK-25019: Summary: The published spark sql pom does not exclude the normal version of orc-core Key: SPARK-25019 URL: https://issues.apache.org/jira/browse/SPARK-25019 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 2.4.0 Reporter: Yin Huai I noticed that [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom] does not exclude the normal version of orc-core. Comparing with [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108] and [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,] we only exclude the normal version of orc-core in the parent pom. So, the problem is that if a developer depends on spark-sql-core directly, orc-core and orc-core-nohive will be in the dependency list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559977#comment-16559977 ] Yin Huai commented on SPARK-24895: -- [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] has some info on it. I am wondering if it requires upgrading both the plugin and maven. We probably need to setup a testing jenkins job to make sure everything works before checking in changes. > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Major > Fix For: 2.4.0 > > > Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {noformat} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {noformat} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554932#comment-16554932 ] Yin Huai commented on SPARK-24895: -- [~hyukjin.kwon] [~kiszk] seems this revert indeed fixed the problem :) > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Major > Fix For: 2.4.0 > > > Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {noformat} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {noformat} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-24895. -- Resolution: Fixed Fix Version/s: 2.4.0 [https://github.com/apache/spark/pull/21865] has been merged. > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Major > Fix For: 2.4.0 > > > Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {noformat} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {noformat} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553610#comment-16553610 ] Yin Huai commented on SPARK-24895: -- [~kiszk] [~hyukjin.kwon] since this thing is pretty tricky to test it out actually, do you mind if I remove the spotbugs and test out our nightly snapshot build? If this plugin is not the cause, I will add it back. If it is indeed the cause, we can figure out how to fix it. Thanks! > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Priority: Major > > Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553568#comment-16553568 ] Yin Huai commented on SPARK-24895: -- [~kiszk] and [~hyukjin.kwon] we hit this issue today. Per [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21,] it may be related to spot-bug plugin. We are trying to verify it now. > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Priority: Major > > Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-24895: - Target Version/s: 2.4.0 > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Priority: Major > > Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23310) Perf regression introduced by SPARK-21113
[ https://issues.apache.org/jira/browse/SPARK-23310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349679#comment-16349679 ] Yin Huai commented on SPARK-23310: -- [~sitalke...@gmail.com] We found that the commit for SPARK-21113 introduced a noticeable regression. Because Q95 is a join heavy join, which represents one set of common workloads, I am concerned that this regression is quite easy to hit by users of Spark 2.3. Considering that setting spark.unsafe.sorter.spill.read.ahead.enabled to false improves the overall performance of all TPC-DS queries, how about we set spark.unsafe.sorter.spill.read.ahead.enabled to false by default in Spark 2.3? Then, we can look into how to resolve this regression for Spark 2.4. What do you think? (Feel free to enable it for your workloads because they will definitely help Spark to improve this part :) ) > Perf regression introduced by SPARK-21113 > - > > Key: SPARK-23310 > URL: https://issues.apache.org/jira/browse/SPARK-23310 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Yin Huai >Priority: Blocker > > While running all TPC-DS queries with SF set to 1000, we noticed that Q95 > (https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q95.sql) > has noticeable regression (11%). After looking into it, we found that the > regression was introduced by SPARK-21113. Specially, ReadAheadInputStream > gets lock congestion. After setting > spark.unsafe.sorter.spill.read.ahead.enabled set to false, the regression > disappear and the overall performance of all TPC-DS queries has improved. > > I am proposing that we set spark.unsafe.sorter.spill.read.ahead.enabled to > false by default for Spark 2.3 and re-enable it after addressing the lock > congestion issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23310) Perf regression introduced by SPARK-21113
[ https://issues.apache.org/jira/browse/SPARK-23310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-23310: - Description: While running all TPC-DS queries with SF set to 1000, we noticed that Q95 (https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q95.sql) has noticeable regression (11%). After looking into it, we found that the regression was introduced by SPARK-21113. Specially, ReadAheadInputStream gets lock congestion. After setting spark.unsafe.sorter.spill.read.ahead.enabled set to false, the regression disappear and the overall performance of all TPC-DS queries has improved. I am proposing that we set spark.unsafe.sorter.spill.read.ahead.enabled to false by default for Spark 2.3 and re-enable it after addressing the lock congestion issue. was: While running all TPC-DS queries with SF set to 1000, we noticed that Q95 has noticeable regression (11%). After looking into it, we found that the regression was introduced by SPARK-21113. Specially, ReadAheadInputStream gets lock congestion. After setting spark.unsafe.sorter.spill.read.ahead.enabled set to false, the regression disappear and the overall performance of all TPC-DS queries has improved. I am proposing that we set spark.unsafe.sorter.spill.read.ahead.enabled to false by default for Spark 2.3 and re-enable it after addressing the lock congestion issue. > Perf regression introduced by SPARK-21113 > - > > Key: SPARK-23310 > URL: https://issues.apache.org/jira/browse/SPARK-23310 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Yin Huai >Priority: Blocker > > While running all TPC-DS queries with SF set to 1000, we noticed that Q95 > (https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q95.sql) > has noticeable regression (11%). After looking into it, we found that the > regression was introduced by SPARK-21113. Specially, ReadAheadInputStream > gets lock congestion. After setting > spark.unsafe.sorter.spill.read.ahead.enabled set to false, the regression > disappear and the overall performance of all TPC-DS queries has improved. > > I am proposing that we set spark.unsafe.sorter.spill.read.ahead.enabled to > false by default for Spark 2.3 and re-enable it after addressing the lock > congestion issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23310) Perf regression introduced by SPARK-21113
Yin Huai created SPARK-23310: Summary: Perf regression introduced by SPARK-21113 Key: SPARK-23310 URL: https://issues.apache.org/jira/browse/SPARK-23310 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Yin Huai While running all TPC-DS queries with SF set to 1000, we noticed that Q95 has noticeable regression (11%). After looking into it, we found that the regression was introduced by SPARK-21113. Specially, ReadAheadInputStream gets lock congestion. After setting spark.unsafe.sorter.spill.read.ahead.enabled set to false, the regression disappear and the overall performance of all TPC-DS queries has improved. I am proposing that we set spark.unsafe.sorter.spill.read.ahead.enabled to false by default for Spark 2.3 and re-enable it after addressing the lock congestion issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349325#comment-16349325 ] Yin Huai commented on SPARK-12297: -- [~zi] has this issue got resolved in Hive? I see HIVE-12767 is still open. > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue >Assignee: Imran Rashid >Priority: Major > Fix For: 2.3.0 > > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23292) python tests related to pandas are skipped
[ https://issues.apache.org/jira/browse/SPARK-23292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-23292: - Priority: Critical (was: Blocker) > python tests related to pandas are skipped > -- > > Key: SPARK-23292 > URL: https://issues.apache.org/jira/browse/SPARK-23292 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Yin Huai >Priority: Critical > > I was running python tests and found that > [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548] > does not run with Python 2 because the test uses "assertRaisesRegex" > (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python > 2). However, spark jenkins does not fail because of this issue (see run > history at > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]). > After looking into this issue, [seems test script will skip tests related to > pandas if pandas is not > installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63], > which means that jenkins does not have pandas installed. > > Since pyarrow related tests have the same skipping logic, we will need to > check if jenkins has pyarrow installed correctly as well. > > Since features using pandas and pyarrow are in 2.3, we should fix the test > issue and make sure all tests pass before we make the release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23292) python tests related to pandas are skipped
[ https://issues.apache.org/jira/browse/SPARK-23292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349040#comment-16349040 ] Yin Huai commented on SPARK-23292: -- So, jenkins does have the right version of pandas and pyarrow for python 3. There were some difficulties on upgrading pandas and install pyarrow in python 2 (see discussions in [https://github.com/apache/spark/pull/19884).] > python tests related to pandas are skipped > -- > > Key: SPARK-23292 > URL: https://issues.apache.org/jira/browse/SPARK-23292 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Yin Huai >Priority: Blocker > > I was running python tests and found that > [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548] > does not run with Python 2 because the test uses "assertRaisesRegex" > (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python > 2). However, spark jenkins does not fail because of this issue (see run > history at > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]). > After looking into this issue, [seems test script will skip tests related to > pandas if pandas is not > installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63], > which means that jenkins does not have pandas installed. > > Since pyarrow related tests have the same skipping logic, we will need to > check if jenkins has pyarrow installed correctly as well. > > Since features using pandas and pyarrow are in 2.3, we should fix the test > issue and make sure all tests pass before we make the release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23292) python tests related to pandas are skipped
Yin Huai created SPARK-23292: Summary: python tests related to pandas are skipped Key: SPARK-23292 URL: https://issues.apache.org/jira/browse/SPARK-23292 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.3.0 Reporter: Yin Huai I was running python tests and found that [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548] does not run with Python 2 because the test uses "assertRaisesRegex" (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python 2). However, spark jenkins does not fail because of this issue (see run history at [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]). After looking into this issue, [seems test script will skip tests related to pandas if pandas is not installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63], which means that jenkins does not have pandas installed. Since pyarrow related tests have the same skipping logic, we will need to check if jenkins has pyarrow installed correctly as well. Since features using pandas and pyarrow are in 2.3, we should fix the test issue and make sure all tests pass before we make the release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet
[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340544#comment-16340544 ] Yin Huai commented on SPARK-4502: - I think it makes sense to target for 2.4.0. 2.3.1 is a maintenance release. Since this is not a bug fix, it is not suitable for a maintenance release. > Spark SQL reads unneccesary nested fields from Parquet > -- > > Key: SPARK-4502 > URL: https://issues.apache.org/jira/browse/SPARK-4502 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Liwen Sun >Priority: Critical > > When reading a field of a nested column from Parquet, SparkSQL reads and > assemble all the fields of that nested column. This is unnecessary, as > Parquet supports fine-grained field reads out of a nested column. This may > degrades the performance significantly when a nested column has many fields. > For example, I loaded json tweets data into SparkSQL and ran the following > query: > {{SELECT User.contributors_enabled from Tweets;}} > User is a nested structure that has 38 primitive fields (for Tweets schema, > see: https://dev.twitter.com/overview/api/tweets), here is the log message: > {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 > cell/ms}} > For comparison, I also ran: > {{SELECT User FROM Tweets;}} > And here is the log message: > {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} > So both queries load 38 columns from Parquet, while the first query only > needs 1 column. I also measured the bytes read within Parquet. In these two > cases, the same number of bytes (99365194 bytes) were read. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22812) Failing cran-check on master
[ https://issues.apache.org/jira/browse/SPARK-22812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295532#comment-16295532 ] Yin Huai commented on SPARK-22812: -- Thank you guys! > Failing cran-check on master > - > > Key: SPARK-22812 > URL: https://issues.apache.org/jira/browse/SPARK-22812 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hossein Falaki >Priority: Minor > > When I run {{R/run-tests.sh}} or {{R/check-cran.sh}} I get the following > failure message: > {code} > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 22] do not match the length of object [0] > {code} > cc [~felixcheung] have you experienced this error before? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21927) Spark pom.xml's dependency management is broken
[ https://issues.apache.org/jira/browse/SPARK-21927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154575#comment-16154575 ] Yin Huai commented on SPARK-21927: -- My worry is that it may mask actual issues related to dependencies. For example, the dependency resolvers may pick a version that is not specified in our pom. > Spark pom.xml's dependency management is broken > --- > > Key: SPARK-21927 > URL: https://issues.apache.org/jira/browse/SPARK-21927 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 > Environment: Apache Spark current master (commit > 12ab7f7e89ec9e102859ab3b710815d3058a2e8d) >Reporter: Kris Mok > > When building the current Spark master just now (commit > 12ab7f7e89ec9e102859ab3b710815d3058a2e8d), I noticed the build prints a lot > of warning messages such as the following. Looks like the dependency > management in the POMs are somehow broken recently. > {code:none} > .../workspace/apache-spark/master (master) $ build/sbt clean package > Attempting to fetch sbt > Launching sbt from build/sbt-launch-0.13.16.jar > [info] Loading project definition from > .../workspace/apache-spark/master/project > [info] Updating > {file:.../workspace/apache-spark/master/project/}master-build... > [info] Resolving org.fusesource.jansi#jansi;1.4 ... > [info] downloading > https://repo1.maven.org/maven2/org/scalastyle/scalastyle-sbt-plugin_2.10_0.13/1.0.0/scalastyle-sbt-plugin-1.0.0.jar > ... > [info] [SUCCESSFUL ] > org.scalastyle#scalastyle-sbt-plugin;1.0.0!scalastyle-sbt-plugin.jar (239ms) > [info] downloading > https://repo1.maven.org/maven2/org/scalastyle/scalastyle_2.10/1.0.0/scalastyle_2.10-1.0.0.jar > ... > [info] [SUCCESSFUL ] > org.scalastyle#scalastyle_2.10;1.0.0!scalastyle_2.10.jar (465ms) > [info] Done updating. > [warn] Found version conflict(s) in library dependencies; some are suspected > to be binary incompatible: > [warn] > [warn] * org.apache.maven.wagon:wagon-provider-api:2.2 is selected over > 1.0-beta-6 > [warn] +- org.apache.maven:maven-compat:3.0.4(depends > on 2.2) > [warn] +- org.apache.maven.wagon:wagon-file:2.2 (depends > on 2.2) > [warn] +- org.spark-project:sbt-pom-reader:1.0.0-spark > (scalaVersion=2.10, sbtVersion=0.13) (depends on 2.2) > [warn] +- org.apache.maven.wagon:wagon-http-shared4:2.2 (depends > on 2.2) > [warn] +- org.apache.maven.wagon:wagon-http:2.2 (depends > on 2.2) > [warn] +- org.apache.maven.wagon:wagon-http-lightweight:2.2 (depends > on 2.2) > [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1 (depends > on 1.0-beta-6) > [warn] > [warn] * org.codehaus.plexus:plexus-utils:3.0 is selected over {2.0.7, > 2.0.6, 2.1, 1.5.5} > [warn] +- org.apache.maven.wagon:wagon-provider-api:2.2 (depends > on 3.0) > [warn] +- org.apache.maven:maven-compat:3.0.4(depends > on 2.0.6) > [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.3.0 (depends > on 2.0.6) > [warn] +- org.apache.maven:maven-artifact:3.0.4 (depends > on 2.0.6) > [warn] +- org.apache.maven:maven-core:3.0.4 (depends > on 2.0.6) > [warn] +- org.sonatype.plexus:plexus-sec-dispatcher:1.3 (depends > on 2.0.6) > [warn] +- org.apache.maven:maven-embedder:3.0.4 (depends > on 2.0.6) > [warn] +- org.apache.maven:maven-settings:3.0.4 (depends > on 2.0.6) > [warn] +- org.apache.maven:maven-settings-builder:3.0.4 (depends > on 2.0.6) > [warn] +- org.apache.maven:maven-model-builder:3.0.4 (depends > on 2.0.7) > [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1 (depends > on 2.0.7) > [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.2.3 (depends > on 2.0.7) > [warn] +- org.apache.maven:maven-model:3.0.4 (depends > on 2.0.7) > [warn] +- org.apache.maven:maven-aether-provider:3.0.4 (depends > on 2.0.7) > [warn] +- org.apache.maven:maven-repository-metadata:3.0.4 (depends > on 2.0.7) > [warn] > [warn] * cglib:cglib is evicted completely > [warn] +- org.sonatype.sisu:sisu-guice:3.0.3 (depends > on 2.2.2) > [warn] > [warn] * asm:asm is evicted completely > [warn] +- cglib:cglib:2.2.2 (depends > on 3.3.1) > [warn] > [warn] Run 'evicted' to see detailed eviction warnings > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Updated] (SPARK-21258) Window result incorrect using complex object with spilling
[ https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-21258: - Fix Version/s: (was: 2.1.2) > Window result incorrect using complex object with spilling > -- > > Key: SPARK-21258 > URL: https://issues.apache.org/jira/browse/SPARK-21258 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21258) Window result incorrect using complex object with spilling
[ https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075055#comment-16075055 ] Yin Huai commented on SPARK-21258: -- Since this change is not in branch-2.1, I am removing 2.1.2 from the list of fix versions. > Window result incorrect using complex object with spilling > -- > > Key: SPARK-21258 > URL: https://issues.apache.org/jira/browse/SPARK-21258 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21111) Fix test failure in 2.2
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-2. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 18316 [https://github.com/apache/spark/pull/18316] > Fix test failure in 2.2 > > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > Fix For: 2.2.0 > > > Test failure: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.2-test-sbt-hadoop-2.7/203/ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work
[ https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reopened SPARK-20311: -- > SQL "range(N) as alias" or "range(N) alias" doesn't work > > > Key: SPARK-20311 > URL: https://issues.apache.org/jira/browse/SPARK-20311 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski >Assignee: Takeshi Yamamuro >Priority: Minor > > `select * from range(10) as A;` or `select * from range(10) A;` > does not work. > As a workaround, a subquery has to be used: > `select * from (select * from range(10)) as A;` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work
[ https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003626#comment-16003626 ] Yin Huai commented on SPARK-20311: -- It introduced a regression (https://github.com/apache/spark/pull/17666#issuecomment-300309896). I have reverted the change. > SQL "range(N) as alias" or "range(N) alias" doesn't work > > > Key: SPARK-20311 > URL: https://issues.apache.org/jira/browse/SPARK-20311 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski >Assignee: Takeshi Yamamuro >Priority: Minor > > `select * from range(10) as A;` or `select * from range(10) A;` > does not work. > As a workaround, a subquery has to be used: > `select * from (select * from range(10)) as A;` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work
[ https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-20311: - Fix Version/s: (was: 2.2.1) (was: 2.3.0) > SQL "range(N) as alias" or "range(N) alias" doesn't work > > > Key: SPARK-20311 > URL: https://issues.apache.org/jira/browse/SPARK-20311 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski >Assignee: Takeshi Yamamuro >Priority: Minor > > `select * from range(10) as A;` or `select * from range(10) A;` > does not work. > As a workaround, a subquery has to be used: > `select * from (select * from range(10)) as A;` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20661) SparkR tableNames() test fails
[ https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-20661: Assignee: Hossein Falaki > SparkR tableNames() test fails > -- > > Key: SPARK-20661 > URL: https://issues.apache.org/jira/browse/SPARK-20661 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki >Assignee: Hossein Falaki > Labels: test > Fix For: 2.2.0 > > > Due to prior state created by other test cases, testing {{tableNames()}} is > failing in master. > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20661) SparkR tableNames() test fails
[ https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-20661. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17903 [https://github.com/apache/spark/pull/17903] > SparkR tableNames() test fails > -- > > Key: SPARK-20661 > URL: https://issues.apache.org/jira/browse/SPARK-20661 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > Labels: test > Fix For: 2.2.0 > > > Due to prior state created by other test cases, testing {{tableNames()}} is > failing in master. > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20358) Executors failing stage on interrupted exception thrown by cancelled tasks
[ https://issues.apache.org/jira/browse/SPARK-20358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-20358: Assignee: Eric Liang > Executors failing stage on interrupted exception thrown by cancelled tasks > -- > > Key: SPARK-20358 > URL: https://issues.apache.org/jira/browse/SPARK-20358 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.2.0 > > > https://issues.apache.org/jira/browse/SPARK-20217 introduced a regression > where now interrupted exceptions will cause a task to fail on cancellation. > This is because NonFatal(e) does not match if e is an InterrupedException. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20358) Executors failing stage on interrupted exception thrown by cancelled tasks
[ https://issues.apache.org/jira/browse/SPARK-20358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-20358. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17659 [https://github.com/apache/spark/pull/17659] > Executors failing stage on interrupted exception thrown by cancelled tasks > -- > > Key: SPARK-20358 > URL: https://issues.apache.org/jira/browse/SPARK-20358 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Eric Liang > Fix For: 2.2.0 > > > https://issues.apache.org/jira/browse/SPARK-20217 introduced a regression > where now interrupted exceptions will cause a task to fail on cancellation. > This is because NonFatal(e) does not match if e is an InterrupedException. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20217) Executor should not fail stage if killed task throws non-interrupted exception
[ https://issues.apache.org/jira/browse/SPARK-20217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-20217: Assignee: Eric Liang > Executor should not fail stage if killed task throws non-interrupted exception > -- > > Key: SPARK-20217 > URL: https://issues.apache.org/jira/browse/SPARK-20217 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.2.0 > > > This is reproducible as follows. Run the following, and then use > SparkContext.killTaskAttempt to kill one of the tasks. The entire stage will > fail since we threw a RuntimeException instead of InterruptedException. > We should probably unconditionally return TaskKilled instead of TaskFailed if > the task was killed by the driver, regardless of the actual exception thrown. > {code} > spark.range(100).repartition(100).foreach { i => > try { > Thread.sleep(1000) > } catch { > case t: InterruptedException => > throw new RuntimeException(t) > } > } > {code} > Based on the code in TaskSetManager, I think this also affects kills of > speculative tasks. However, since the number of speculated tasks is few, and > usually you need to fail a task a few times before the stage is cancelled, > probably no-one noticed this in production. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20217) Executor should not fail stage if killed task throws non-interrupted exception
[ https://issues.apache.org/jira/browse/SPARK-20217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-20217. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17531 [https://github.com/apache/spark/pull/17531] > Executor should not fail stage if killed task throws non-interrupted exception > -- > > Key: SPARK-20217 > URL: https://issues.apache.org/jira/browse/SPARK-20217 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Eric Liang > Fix For: 2.2.0 > > > This is reproducible as follows. Run the following, and then use > SparkContext.killTaskAttempt to kill one of the tasks. The entire stage will > fail since we threw a RuntimeException instead of InterruptedException. > We should probably unconditionally return TaskKilled instead of TaskFailed if > the task was killed by the driver, regardless of the actual exception thrown. > {code} > spark.range(100).repartition(100).foreach { i => > try { > Thread.sleep(1000) > } catch { > case t: InterruptedException => > throw new RuntimeException(t) > } > } > {code} > Based on the code in TaskSetManager, I think this also affects kills of > speculative tasks. However, since the number of speculated tasks is few, and > usually you need to fail a task a few times before the stage is cancelled, > probably no-one noticed this in production. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14388) Create Table
[ https://issues.apache.org/jira/browse/SPARK-14388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15932960#comment-15932960 ] Yin Huai commented on SPARK-14388: -- [~erlu] I see. Can you create a jira for this? Let's put an example in the description of that jira to explain the problem. Also, it will be great if you want to submit a pr to make the change :) > Create Table > > > Key: SPARK-14388 > URL: https://issues.apache.org/jira/browse/SPARK-14388 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Andrew Or > Fix For: 2.0.0 > > > For now, we still ask Hive to handle creating hive tables. We should handle > them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19620) Incorrect exchange coordinator Id in physical plan
[ https://issues.apache.org/jira/browse/SPARK-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-19620: Assignee: Carson Wang > Incorrect exchange coordinator Id in physical plan > -- > > Key: SPARK-19620 > URL: https://issues.apache.org/jira/browse/SPARK-19620 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Carson Wang >Assignee: Carson Wang >Priority: Minor > Fix For: 2.2.0 > > > When adaptive execution is enabled, an exchange coordinator is used to in the > Exchange operators. For Join, the same exchange coordinator is used for its > two Exchanges. But the physical plan shows two different coordinator Ids > which is confusing. > Here is an example: > {code} > == Physical Plan == > *Project [key1#3L, value2#12L] > +- *SortMergeJoin [key1#3L], [key2#11L], Inner >:- *Sort [key1#3L ASC NULLS FIRST], false, 0 >: +- Exchange(coordinator id: 1804587700) hashpartitioning(key1#3L, 10), > coordinator[target post-shuffle partition size: 67108864] >: +- *Project [(id#0L % 500) AS key1#3L] >:+- *Filter isnotnull((id#0L % 500)) >: +- *Range (0, 1000, step=1, splits=Some(10)) >+- *Sort [key2#11L ASC NULLS FIRST], false, 0 > +- Exchange(coordinator id: 793927319) hashpartitioning(key2#11L, 10), > coordinator[target post-shuffle partition size: 67108864] > +- *Project [(id#8L % 500) AS key2#11L, id#8L AS value2#12L] > +- *Filter isnotnull((id#8L % 500)) >+- *Range (0, 1000, step=1, splits=Some(10)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19620) Incorrect exchange coordinator Id in physical plan
[ https://issues.apache.org/jira/browse/SPARK-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-19620. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16952 [https://github.com/apache/spark/pull/16952] > Incorrect exchange coordinator Id in physical plan > -- > > Key: SPARK-19620 > URL: https://issues.apache.org/jira/browse/SPARK-19620 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Carson Wang >Priority: Minor > Fix For: 2.2.0 > > > When adaptive execution is enabled, an exchange coordinator is used to in the > Exchange operators. For Join, the same exchange coordinator is used for its > two Exchanges. But the physical plan shows two different coordinator Ids > which is confusing. > Here is an example: > {code} > == Physical Plan == > *Project [key1#3L, value2#12L] > +- *SortMergeJoin [key1#3L], [key2#11L], Inner >:- *Sort [key1#3L ASC NULLS FIRST], false, 0 >: +- Exchange(coordinator id: 1804587700) hashpartitioning(key1#3L, 10), > coordinator[target post-shuffle partition size: 67108864] >: +- *Project [(id#0L % 500) AS key1#3L] >:+- *Filter isnotnull((id#0L % 500)) >: +- *Range (0, 1000, step=1, splits=Some(10)) >+- *Sort [key2#11L ASC NULLS FIRST], false, 0 > +- Exchange(coordinator id: 793927319) hashpartitioning(key2#11L, 10), > coordinator[target post-shuffle partition size: 67108864] > +- *Project [(id#8L % 500) AS key2#11L, id#8L AS value2#12L] > +- *Filter isnotnull((id#8L % 500)) >+- *Range (0, 1000, step=1, splits=Some(10)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level
[ https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-19816: - Fix Version/s: 2.1.1 > DataFrameCallbackSuite doesn't recover the log level > > > Key: SPARK-19816 > URL: https://issues.apache.org/jira/browse/SPARK-19816 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.1, 2.2.0 > > > "DataFrameCallbackSuite.execute callback functions when a DataFrame action > failed" sets the log level to "fatal" but doesn't recover it. Hence, tests > running after it won't output any logs except fatal logs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19604) Log the start of every Python test
[ https://issues.apache.org/jira/browse/SPARK-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15869023#comment-15869023 ] Yin Huai commented on SPARK-19604: -- It has been resolved by https://github.com/apache/spark/pull/16935. > Log the start of every Python test > -- > > Key: SPARK-19604 > URL: https://issues.apache.org/jira/browse/SPARK-19604 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0 >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.3, 2.1.1 > > > Right now, we only have info level log after we finish the tests of a Python > test file. We should also log the start of a test. So, if a test is hanging, > we can tell which test file is running. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19604) Log the start of every Python test
[ https://issues.apache.org/jira/browse/SPARK-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-19604. -- Resolution: Fixed Fix Version/s: 2.1.1 2.0.3 > Log the start of every Python test > -- > > Key: SPARK-19604 > URL: https://issues.apache.org/jira/browse/SPARK-19604 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0 >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.3, 2.1.1 > > > Right now, we only have info level log after we finish the tests of a Python > test file. We should also log the start of a test. So, if a test is hanging, > we can tell which test file is running. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19604) Log the start of every Python test
[ https://issues.apache.org/jira/browse/SPARK-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-19604: Assignee: Yin Huai > Log the start of every Python test > -- > > Key: SPARK-19604 > URL: https://issues.apache.org/jira/browse/SPARK-19604 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0 >Reporter: Yin Huai >Assignee: Yin Huai > > Right now, we only have info level log after we finish the tests of a Python > test file. We should also log the start of a test. So, if a test is hanging, > we can tell which test file is running. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19604) Log the start of every Python test
Yin Huai created SPARK-19604: Summary: Log the start of every Python test Key: SPARK-19604 URL: https://issues.apache.org/jira/browse/SPARK-19604 Project: Spark Issue Type: Test Components: Tests Affects Versions: 2.1.0 Reporter: Yin Huai Right now, we only have info level log after we finish the tests of a Python test file. We should also log the start of a test. So, if a test is hanging, we can tell which test file is running. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19321) Support Hive 2.x's metastore
Yin Huai created SPARK-19321: Summary: Support Hive 2.x's metastore Key: SPARK-19321 URL: https://issues.apache.org/jira/browse/SPARK-19321 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai It will be good to make Spark work with Hive 2.x's metastores. We need to add needed shim classes in https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala. Make IsolatedClientLoader recognize new versions of metastores (https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala). Finally, we want to add tests in https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19295) IsolatedClientLoader's downloadVersion should log the location of downloaded metastore client jars
[ https://issues.apache.org/jira/browse/SPARK-19295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-19295. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16649 [https://github.com/apache/spark/pull/16649] > IsolatedClientLoader's downloadVersion should log the location of downloaded > metastore client jars > -- > > Key: SPARK-19295 > URL: https://issues.apache.org/jira/browse/SPARK-19295 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Minor > Fix For: 2.2.0 > > > When you set {{spark.sql.hive.metastore.jars}} to {{maven}}, spark will > download metastore client jars and their dependencies. It will be good to log > the location of those downloaded jars. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19295) IsolatedClientLoader's downloadVersion should log the location of downloaded metastore client jars
[ https://issues.apache.org/jira/browse/SPARK-19295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-19295: - Priority: Minor (was: Major) > IsolatedClientLoader's downloadVersion should log the location of downloaded > metastore client jars > -- > > Key: SPARK-19295 > URL: https://issues.apache.org/jira/browse/SPARK-19295 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Minor > > When you set {{spark.sql.hive.metastore.jars}} to {{maven}}, spark will > download metastore client jars and their dependencies. It will be good to log > the location of those downloaded jars. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19295) IsolatedClientLoader's downloadVersion should log the location of downloaded metastore client jars
[ https://issues.apache.org/jira/browse/SPARK-19295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-19295: - Issue Type: Improvement (was: Bug) > IsolatedClientLoader's downloadVersion should log the location of downloaded > metastore client jars > -- > > Key: SPARK-19295 > URL: https://issues.apache.org/jira/browse/SPARK-19295 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Minor > > When you set {{spark.sql.hive.metastore.jars}} to {{maven}}, spark will > download metastore client jars and their dependencies. It will be good to log > the location of those downloaded jars. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19295) IsolatedClientLoader's downloadVersion should log the location of downloaded metastore client jars
Yin Huai created SPARK-19295: Summary: IsolatedClientLoader's downloadVersion should log the location of downloaded metastore client jars Key: SPARK-19295 URL: https://issues.apache.org/jira/browse/SPARK-19295 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai When you set {{spark.sql.hive.metastore.jars}} to {{maven}}, spark will download metastore client jars and their dependencies. It will be good to log the location of those downloaded jars. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18885) unify CREATE TABLE syntax for data source and hive serde tables
[ https://issues.apache.org/jira/browse/SPARK-18885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18885. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16296 [https://github.com/apache/spark/pull/16296] > unify CREATE TABLE syntax for data source and hive serde tables > --- > > Key: SPARK-18885 > URL: https://issues.apache.org/jira/browse/SPARK-18885 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > Attachments: CREATE-TABLE.pdf > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19072) Catalyst's IN always returns false for infinity
[ https://issues.apache.org/jira/browse/SPARK-19072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-19072. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16469 [https://github.com/apache/spark/pull/16469] > Catalyst's IN always returns false for infinity > --- > > Key: SPARK-19072 > URL: https://issues.apache.org/jira/browse/SPARK-19072 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Reporter: Kay Ousterhout >Assignee: Wenchen Fan > Fix For: 2.2.0 > > > This bug was caused by the fix for SPARK-18999 > (https://github.com/apache/spark/pull/16402) > This can be reproduced by adding the following test to PredicateSuite.scala > (which will consistently fail): > val value = NonFoldableLiteral(Double.PositiveInfinity, DoubleType) > checkEvaluation(In(value, List(value)), true) > This bug is causing > org.apache.spark.sql.catalyst.expressions.PredicateSuite.IN to fail > approximately 10% of the time (it fails anytime the value is Infinity or > -Infinity and the correct answer is True -- e.g., > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70826/testReport/org.apache.spark.sql.catalyst.expressions/PredicateSuite/IN/, > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70830/console). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19072) Catalyst's IN always returns false for infinity
[ https://issues.apache.org/jira/browse/SPARK-19072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-19072: - Assignee: Wenchen Fan > Catalyst's IN always returns false for infinity > --- > > Key: SPARK-19072 > URL: https://issues.apache.org/jira/browse/SPARK-19072 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Reporter: Kay Ousterhout >Assignee: Wenchen Fan > > This bug was caused by the fix for SPARK-18999 > (https://github.com/apache/spark/pull/16402) > This can be reproduced by adding the following test to PredicateSuite.scala > (which will consistently fail): > val value = NonFoldableLiteral(Double.PositiveInfinity, DoubleType) > checkEvaluation(In(value, List(value)), true) > This bug is causing > org.apache.spark.sql.catalyst.expressions.PredicateSuite.IN to fail > approximately 10% of the time (it fails anytime the value is Infinity or > -Infinity and the correct answer is True -- e.g., > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70826/testReport/org.apache.spark.sql.catalyst.expressions/PredicateSuite/IN/, > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70830/console). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18567) Simplify CreateDataSourceTableAsSelectCommand
[ https://issues.apache.org/jira/browse/SPARK-18567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18567. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 15996 [https://github.com/apache/spark/pull/15996] > Simplify CreateDataSourceTableAsSelectCommand > - > > Key: SPARK-18567 > URL: https://issues.apache.org/jira/browse/SPARK-18567 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16552) Store the Inferred Schemas into External Catalog Tables when Creating Tables
[ https://issues.apache.org/jira/browse/SPARK-16552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15783705#comment-15783705 ] Yin Huai commented on SPARK-16552: -- [~smilegator] [~cloud_fan] i think we will not do partitioning discovery after SPARK-17861 by default right? Can you help me check if we still need to write anything about this in the release notes? > Store the Inferred Schemas into External Catalog Tables when Creating Tables > > > Key: SPARK-16552 > URL: https://issues.apache.org/jira/browse/SPARK-16552 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > Labels: release_notes, releasenotes > Fix For: 2.1.0 > > > Currently, in Spark SQL, the initial creation of schema can be classified > into two groups. It is applicable to both Hive tables and Data Source tables: > Group A. Users specify the schema. > Case 1 CREATE TABLE AS SELECT: the schema is determined by the result schema > of the SELECT clause. For example, > {noformat} > CREATE TABLE tab STORED AS TEXTFILE > AS SELECT * from input > {noformat} > Case 2 CREATE TABLE: users explicitly specify the schema. For example, > {noformat} > CREATE TABLE jsonTable (_1 string, _2 string) > USING org.apache.spark.sql.json > {noformat} > Group B. Spark SQL infer the schema at runtime. > Case 3 CREATE TABLE. Users do not specify the schema but the path to the file > location. For example, > {noformat} > CREATE TABLE jsonTable > USING org.apache.spark.sql.json > OPTIONS (path '${tempDir.getCanonicalPath}') > {noformat} > Now, Spark SQL does not store the inferred schema in the external catalog for > the cases in Group B. When users refreshing the metadata cache, accessing the > table at the first time after (re-)starting Spark, Spark SQL will infer the > schema and store the info in the metadata cache for improving the performance > of subsequent metadata requests. However, the runtime schema inference could > cause undesirable schema changes after each reboot of Spark. > It is desirable to store the inferred schema in the external catalog when > creating the table. When users intend to refresh the schema, they issue > `REFRESH TABLE`. Spark SQL will infer the schema again based on the > previously specified table location and update/refresh the schema in the > external catalog and metadata cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18990) make DatasetBenchmark fairer for Dataset
[ https://issues.apache.org/jira/browse/SPARK-18990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18990: - Fix Version/s: (was: 2.2.0) > make DatasetBenchmark fairer for Dataset > > > Key: SPARK-18990 > URL: https://issues.apache.org/jira/browse/SPARK-18990 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-18990) make DatasetBenchmark fairer for Dataset
[ https://issues.apache.org/jira/browse/SPARK-18990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reopened SPARK-18990: -- > make DatasetBenchmark fairer for Dataset > > > Key: SPARK-18990 > URL: https://issues.apache.org/jira/browse/SPARK-18990 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18951) Upgrade com.thoughtworks.paranamer/paranamer to 2.6
[ https://issues.apache.org/jira/browse/SPARK-18951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18951. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16359 [https://github.com/apache/spark/pull/16359] > Upgrade com.thoughtworks.paranamer/paranamer to 2.6 > --- > > Key: SPARK-18951 > URL: https://issues.apache.org/jira/browse/SPARK-18951 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.2.0 > > > I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes > jackson fail to handle byte array defined in a case class. Then I find > https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests > that it is caused by a bug in paranamer. Let's upgrade paranamer. > Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use > com.thoughtworks.paranamer/paranamer 2.6, I suggests that we upgrade > paranamer to 2.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18928) FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation
[ https://issues.apache.org/jira/browse/SPARK-18928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18928: - Fix Version/s: 2.0.3 > FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation > --- > > Key: SPARK-18928 > URL: https://issues.apache.org/jira/browse/SPARK-18928 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Spark tasks respond to cancellation by checking > {{TaskContext.isInterrupted()}}, but this check is missing on a few critical > paths used in Spark SQL, including FileScanRDD, JDBCRDD, and > UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to > continue running and become zombies. > Here's an example: first, create a giant text file. In my case, I just > concatenated /usr/share/dict/words a bunch of times to produce a 2.75 gig > file. Then, run a really slow query over that file and try to cancel it: > {code} > spark.read.text("/tmp/words").selectExpr("value + value + value").collect() > {code} > This will sit and churn at 100% CPU for a minute or two because the task > isn't checking the interrupted flag. > The solution here is to add InterruptedIterator-style checks to a few > locations where they're currently missing in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures
[ https://issues.apache.org/jira/browse/SPARK-18761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18761: - Fix Version/s: 2.1.1 2.0.3 > Uncancellable / unkillable tasks may starve jobs of resoures > > > Key: SPARK-18761 > URL: https://issues.apache.org/jira/browse/SPARK-18761 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Spark's current task cancellation / task killing mechanism is "best effort" > in the sense that some tasks may not be interruptible and may not respond to > their "killed" flags being set. If a significant fraction of a cluster's task > slots are occupied by tasks that have been marked as killed but remain > running then this can lead to a situation where new jobs and tasks are > starved of resources because zombie tasks are holding resources. > I propose to address this problem by introducing a "task reaper" mechanism in > executors to monitor tasks after they are marked for killing in order to > periodically re-attempt the task kill, capture and log stacktraces / warnings > if tasks do not exit in a timely manner, and, optionally, kill the entire > executor JVM if cancelled tasks cannot be killed within some timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18953) Do not show the link to a dead worker on the master page
Yin Huai created SPARK-18953: Summary: Do not show the link to a dead worker on the master page Key: SPARK-18953 URL: https://issues.apache.org/jira/browse/SPARK-18953 Project: Spark Issue Type: Bug Components: Web UI Reporter: Yin Huai The master page seems still show links to dead workers. For a dead worker, we will not be able to see its worker page anyway. Seems makes sense to not show links to dead workers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18951) Upgrade com.thoughtworks.paranamer/paranamer to 2.6
[ https://issues.apache.org/jira/browse/SPARK-18951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18951: - Description: I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes jackson fail to handle byte array defined in a case class. Then I find https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests that it is caused by a bug in paranamer. Let's upgrade paranamer. Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use com.thoughtworks.paranamer/paranamer 2.6, I suggests that we upgrade paranamer to 2.6. was: I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes jackson fail to handle byte array defined in a case class. Then I find https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests that it is caused by a bug in paranamer. Let's upgrade paranamer. Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use com.thoughtworks.paranamer/paranamer uses 2.6, I suggests that we upgrade paranamer to 2.6. > Upgrade com.thoughtworks.paranamer/paranamer to 2.6 > --- > > Key: SPARK-18951 > URL: https://issues.apache.org/jira/browse/SPARK-18951 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Yin Huai >Assignee: Yin Huai > > I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes > jackson fail to handle byte array defined in a case class. Then I find > https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests > that it is caused by a bug in paranamer. Let's upgrade paranamer. > Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use > com.thoughtworks.paranamer/paranamer 2.6, I suggests that we upgrade > paranamer to 2.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18951) Upgrade com.thoughtworks.paranamer/paranamer to 2.6
[ https://issues.apache.org/jira/browse/SPARK-18951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-18951: Assignee: Yin Huai > Upgrade com.thoughtworks.paranamer/paranamer to 2.6 > --- > > Key: SPARK-18951 > URL: https://issues.apache.org/jira/browse/SPARK-18951 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Yin Huai >Assignee: Yin Huai > > I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes > jackson fail to handle byte array defined in a case class. Then I find > https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests > that it is caused by a bug in paranamer. Let's upgrade paranamer. > Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use > com.thoughtworks.paranamer/paranamer uses 2.6, I suggests that we upgrade > paranamer to 2.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18951) Upgrade com.thoughtworks.paranamer/paranamer
Yin Huai created SPARK-18951: Summary: Upgrade com.thoughtworks.paranamer/paranamer Key: SPARK-18951 URL: https://issues.apache.org/jira/browse/SPARK-18951 Project: Spark Issue Type: Bug Components: Build Reporter: Yin Huai I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes jackson fail to handle byte array defined in a case class. Then I find https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests that it is caused by a bug in paranamer. Let's upgrade paranamer. Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use com.thoughtworks.paranamer/paranamer uses 2.6, I suggests that we upgrade paranamer to 2.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18951) Upgrade com.thoughtworks.paranamer/paranamer to 2.6
[ https://issues.apache.org/jira/browse/SPARK-18951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18951: - Summary: Upgrade com.thoughtworks.paranamer/paranamer to 2.6 (was: Upgrade com.thoughtworks.paranamer/paranamer) > Upgrade com.thoughtworks.paranamer/paranamer to 2.6 > --- > > Key: SPARK-18951 > URL: https://issues.apache.org/jira/browse/SPARK-18951 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Yin Huai > > I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes > jackson fail to handle byte array defined in a case class. Then I find > https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests > that it is caused by a bug in paranamer. Let's upgrade paranamer. > Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use > com.thoughtworks.paranamer/paranamer uses 2.6, I suggests that we upgrade > paranamer to 2.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures
[ https://issues.apache.org/jira/browse/SPARK-18761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18761. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16189 [https://github.com/apache/spark/pull/16189] > Uncancellable / unkillable tasks may starve jobs of resoures > > > Key: SPARK-18761 > URL: https://issues.apache.org/jira/browse/SPARK-18761 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.2.0 > > > Spark's current task cancellation / task killing mechanism is "best effort" > in the sense that some tasks may not be interruptible and may not respond to > their "killed" flags being set. If a significant fraction of a cluster's task > slots are occupied by tasks that have been marked as killed but remain > running then this can lead to a situation where new jobs and tasks are > starved of resources because zombie tasks are holding resources. > I propose to address this problem by introducing a "task reaper" mechanism in > executors to monitor tasks after they are marked for killing in order to > periodically re-attempt the task kill, capture and log stacktraces / warnings > if tasks do not exit in a timely manner, and, optionally, kill the entire > executor JVM if cancelled tasks cannot be killed within some timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18921) check database existence with Hive.databaseExists instead of getDatabase
[ https://issues.apache.org/jira/browse/SPARK-18921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18921. -- Resolution: Fixed Fix Version/s: 2.1.1 Issue resolved by pull request 16332 [https://github.com/apache/spark/pull/16332] > check database existence with Hive.databaseExists instead of getDatabase > > > Key: SPARK-18921 > URL: https://issues.apache.org/jira/browse/SPARK-18921 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > Fix For: 2.1.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-13747. -- Resolution: Fixed Fix Version/s: (was: 2.0.2) (was: 2.1.0) 2.2.0 Issue resolved by pull request 16230 [https://github.com/apache/spark/pull/16230] > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.2.0 > > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18675) CTAS for hive serde table should work for all hive versions
[ https://issues.apache.org/jira/browse/SPARK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18675. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16104 [https://github.com/apache/spark/pull/16104] > CTAS for hive serde table should work for all hive versions > --- > > Key: SPARK-18675 > URL: https://issues.apache.org/jira/browse/SPARK-18675 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched
[ https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18816: - Assignee: Alex Bozarth > executor page fails to show log links if executors are added after an app is > launched > - > > Key: SPARK-18816 > URL: https://issues.apache.org/jira/browse/SPARK-18816 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Yin Huai >Assignee: Alex Bozarth >Priority: Blocker > Attachments: screenshot-1.png > > > How to reproduce with standalone mode: > 1. Launch a spark master > 2. Launch a spark shell. At this point, there is no executor associated with > this application. > 3. Launch a slave. Now, there is an executor assigned to the spark shell. > However, there is no link to stdout/stderr on the executor page (please see > https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched
[ https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743298#comment-15743298 ] Yin Huai commented on SPARK-18816: -- Yea, log pages are still there. But, without those links on the executor page, it is very hard to find those pages. btw, is there any place that we should look at to find the cause of this problem? > executor page fails to show log links if executors are added after an app is > launched > - > > Key: SPARK-18816 > URL: https://issues.apache.org/jira/browse/SPARK-18816 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Yin Huai >Priority: Blocker > Attachments: screenshot-1.png > > > How to reproduce with standalone mode: > 1. Launch a spark master > 2. Launch a spark shell. At this point, there is no executor associated with > this application. > 3. Launch a slave. Now, there is an executor assigned to the spark shell. > However, there is no link to stdout/stderr on the executor page (please see > https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched
[ https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18816: - Priority: Blocker (was: Major) > executor page fails to show log links if executors are added after an app is > launched > - > > Key: SPARK-18816 > URL: https://issues.apache.org/jira/browse/SPARK-18816 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Yin Huai >Priority: Blocker > Attachments: screenshot-1.png > > > How to reproduce with standalone mode: > 1. Launch a spark master > 2. Launch a spark shell. At this point, there is no executor associated with > this application. > 3. Launch a slave. Now, there is an executor assigned to the spark shell. > However, there is no link to stdout/stderr on the executor page (please see > https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched
[ https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743260#comment-15743260 ] Yin Huai commented on SPARK-18816: -- [~ajbozarth] Yea, please take a look. Thanks! The reasons that I set it as a blocker are (1) those log links are super important for debugging; and (2) it is a regression from 2.0. > executor page fails to show log links if executors are added after an app is > launched > - > > Key: SPARK-18816 > URL: https://issues.apache.org/jira/browse/SPARK-18816 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Yin Huai > Attachments: screenshot-1.png > > > How to reproduce with standalone mode: > 1. Launch a spark master > 2. Launch a spark shell. At this point, there is no executor associated with > this application. > 3. Launch a slave. Now, there is an executor assigned to the spark shell. > However, there is no link to stdout/stderr on the executor page (please see > https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched
[ https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15737254#comment-15737254 ] Yin Huai commented on SPARK-18816: -- btw, my testing was done with chrome. I then terminated the cluster and started a new one. I first launched workers. Then, I still could not see the log links on the page. But, I can see the links from safari. > executor page fails to show log links if executors are added after an app is > launched > - > > Key: SPARK-18816 > URL: https://issues.apache.org/jira/browse/SPARK-18816 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Yin Huai >Priority: Blocker > Attachments: screenshot-1.png > > > How to reproduce with standalone mode: > 1. Launch a spark master > 2. Launch a spark shell. At this point, there is no executor associated with > this application. > 3. Launch a slave. Now, there is an executor assigned to the spark shell. > However, there is no link to stdout/stderr on the executor page (please see > https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched
[ https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18816: - Attachment: screenshot-1.png > executor page fails to show log links if executors are added after an app is > launched > - > > Key: SPARK-18816 > URL: https://issues.apache.org/jira/browse/SPARK-18816 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Yin Huai >Priority: Blocker > Attachments: screenshot-1.png > > > How to reproduce with standalone mode: > 1. Launch a spark master > 2. Launch a spark shell. At this point, there is no executor associated with > this application. > 3. Launch a slave. Now, there is an executor assigned to the spark shell. > However, there is no link to stdout/stderr on the executor page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched
Yin Huai created SPARK-18816: Summary: executor page fails to show log links if executors are added after an app is launched Key: SPARK-18816 URL: https://issues.apache.org/jira/browse/SPARK-18816 Project: Spark Issue Type: Bug Components: Web UI Reporter: Yin Huai Priority: Blocker Attachments: screenshot-1.png How to reproduce with standalone mode: 1. Launch a spark master 2. Launch a spark shell. At this point, there is no executor associated with this application. 3. Launch a slave. Now, there is an executor assigned to the spark shell. However, there is no link to stdout/stderr on the executor page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched
[ https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18816: - Description: How to reproduce with standalone mode: 1. Launch a spark master 2. Launch a spark shell. At this point, there is no executor associated with this application. 3. Launch a slave. Now, there is an executor assigned to the spark shell. However, there is no link to stdout/stderr on the executor page (please see https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png). was: How to reproduce with standalone mode: 1. Launch a spark master 2. Launch a spark shell. At this point, there is no executor associated with this application. 3. Launch a slave. Now, there is an executor assigned to the spark shell. However, there is no link to stdout/stderr on the executor page. > executor page fails to show log links if executors are added after an app is > launched > - > > Key: SPARK-18816 > URL: https://issues.apache.org/jira/browse/SPARK-18816 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Yin Huai >Priority: Blocker > Attachments: screenshot-1.png > > > How to reproduce with standalone mode: > 1. Launch a spark master > 2. Launch a spark shell. At this point, there is no executor associated with > this application. > 3. Launch a slave. Now, there is an executor assigned to the spark shell. > However, there is no link to stdout/stderr on the executor page (please see > https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18284) Scheme of DataFrame generated from RDD is diffrent between master and 2.0
[ https://issues.apache.org/jira/browse/SPARK-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723012#comment-15723012 ] Yin Huai commented on SPARK-18284: -- [~kiszk] btw, do we know what caused the nullable setting change in 2.1? > Scheme of DataFrame generated from RDD is diffrent between master and 2.0 > - > > Key: SPARK-18284 > URL: https://issues.apache.org/jira/browse/SPARK-18284 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.2.0 > > > When the following program is executed, a schema of dataframe is different > among master, branch 2.0, and branch 2.1. The result should be false. > {code:java} > val df = sparkContext.parallelize(1 to 8, 1).toDF() > df.printSchema > df.filter("value > 4").count > === master === > root > |-- value: integer (nullable = true) > === branch 2.1 === > root > |-- value: integer (nullable = true) > === branch 2.0 === > root > |-- value: integer (nullable = false) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18660) Parquet complains "Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Yin Huai created SPARK-18660: Summary: Parquet complains "Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl " Key: SPARK-18660 URL: https://issues.apache.org/jira/browse/SPARK-18660 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Parquet record reader always complain "Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl". Looks like we always create TaskAttemptContextImpl (https://github.com/apache/spark/blob/2f7461f31331cfc37f6cfa3586b7bbefb3af5547/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L368). But, Parquet wants to use TaskInputOutputContext, which is a subclass of TaskAttemptContextImpl. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18631) Avoid making data skew worse in ExchangeCoordinator
[ https://issues.apache.org/jira/browse/SPARK-18631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18631. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16065 [https://github.com/apache/spark/pull/16065] > Avoid making data skew worse in ExchangeCoordinator > --- > > Key: SPARK-18631 > URL: https://issues.apache.org/jira/browse/SPARK-18631 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.0 >Reporter: Mark Hamstra >Assignee: Mark Hamstra > Fix For: 2.2.0 > > > The logic to resize partitions in the ExchangeCoordinator is to not start a > new partition until the targetPostShuffleInputSize is equalled or exceeded. > This can make data skew problems worse since a number of small partitions can > first be combined as long as the combined size remains smaller than the > targetPostShuffleInputSize, and then a large, data-skewed partition can be > further combined, making it even bigger than it already was. > It's a fairly simple to change the logic to create a new partition if adding > a new piece would exceed the targetPostShuffleInputSize instead of only > creating a new partition after the targetPostShuffleInputSize has already > been exceeded. This results in a few more partitions being created by the > ExchangeCoordinator, but data skew problems are at least not made worse even > though they are not made any better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18468) Flaky test: org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet relation with decimal column
[ https://issues.apache.org/jira/browse/SPARK-18468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18468: - Target Version/s: (was: 2.1.0) > Flaky test: org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist > Parquet relation with decimal column > -- > > Key: SPARK-18468 > URL: https://issues.apache.org/jira/browse/SPARK-18468 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Yin Huai >Priority: Critical > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.1-test-sbt-hadoop-2.4/71/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/SPARK_9757_Persist_Parquet_relation_with_decimal_column/ > https://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.4/71 > Seems we failed to stop the driver > {code} > 2016-11-15 18:36:47.76 - stderr> org.apache.spark.rpc.RpcTimeoutException: > Cannot receive any reply in 120 seconds. This timeout is controlled by > spark.rpc.askTimeout > 2016-11-15 18:36:47.76 - stderr> at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > 2016-11-15 18:36:47.76 - stderr> at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > 2016-11-15 18:36:47.76 - stderr> at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > 2016-11-15 18:36:47.76 - stderr> at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > 2016-11-15 18:36:47.76 - stderr> at > scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216) > 2016-11-15 18:36:47.76 - stderr> at scala.util.Try$.apply(Try.scala:192) > 2016-11-15 18:36:47.76 - stderr> at > scala.util.Failure.recover(Try.scala:216) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > 2016-11-15 18:36:47.76 - stderr> at > com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Promise$class.complete(Promise.scala:55) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Promise$class.t
[jira] [Assigned] (SPARK-18602) Dependency list still shows that the version of org.codehaus.janino:commons-compiler is 2.7.6
[ https://issues.apache.org/jira/browse/SPARK-18602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-18602: Assignee: Yin Huai > Dependency list still shows that the version of > org.codehaus.janino:commons-compiler is 2.7.6 > - > > Key: SPARK-18602 > URL: https://issues.apache.org/jira/browse/SPARK-18602 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.1.0 >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.1.0 > > > org.codehaus.janino:janino:3.0.0 depends on > org.codehaus.janino:commons-compiler:3.0.0. > However, > https://github.com/apache/spark/blob/branch-2.1/dev/deps/spark-deps-hadoop-2.7 > still shows that commons-compiler from janino is 2.7.6. This is probably > because hive module depends on calcite-core, which depends on > commons-compiler 2.7.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18602) Dependency list still shows that the version of org.codehaus.janino:commons-compiler is 2.7.6
[ https://issues.apache.org/jira/browse/SPARK-18602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18602. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 16025 [https://github.com/apache/spark/pull/16025] > Dependency list still shows that the version of > org.codehaus.janino:commons-compiler is 2.7.6 > - > > Key: SPARK-18602 > URL: https://issues.apache.org/jira/browse/SPARK-18602 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.1.0 >Reporter: Yin Huai > Fix For: 2.1.0 > > > org.codehaus.janino:janino:3.0.0 depends on > org.codehaus.janino:commons-compiler:3.0.0. > However, > https://github.com/apache/spark/blob/branch-2.1/dev/deps/spark-deps-hadoop-2.7 > still shows that commons-compiler from janino is 2.7.6. This is probably > because hive module depends on calcite-core, which depends on > commons-compiler 2.7.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18602) Dependency list still shows that the version of org.codehaus.janino:commons-compiler is 2.7.6
Yin Huai created SPARK-18602: Summary: Dependency list still shows that the version of org.codehaus.janino:commons-compiler is 2.7.6 Key: SPARK-18602 URL: https://issues.apache.org/jira/browse/SPARK-18602 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 2.1.0 Reporter: Yin Huai org.codehaus.janino:janino:3.0.0 depends on org.codehaus.janino:commons-compiler:3.0.0. However, https://github.com/apache/spark/blob/branch-2.1/dev/deps/spark-deps-hadoop-2.7 still shows that commons-compiler from janino is 2.7.6. This is probably because hive module depends on calcite-core, which depends on commons-compiler 2.7.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18544) Append with df.saveAsTable writes data to wrong location
[ https://issues.apache.org/jira/browse/SPARK-18544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18544: - Priority: Blocker (was: Major) > Append with df.saveAsTable writes data to wrong location > > > Key: SPARK-18544 > URL: https://issues.apache.org/jira/browse/SPARK-18544 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang >Priority: Blocker > > When using saveAsTable in append mode, data will be written to the wrong > location for non-managed Datasource tables. The following example illustrates > this. > It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from > DataFrameWriter. Also, we should probably remove the repair table call at the > end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the > Hive or Datasource case. > {code} > scala> spark.sqlContext.range(1).selectExpr("id", "id as A", "id as > B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test_10k") > scala> sql("msck repair table test_10k") > scala> sql("select * from test_10k where A = 1").count > res6: Long = 1 > scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as > B").write.partitionBy("A", "B").mode("append").parquet("/tmp/test_10k") > scala> sql("select * from test_10k where A = 1").count > res8: Long = 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15513) Bzip2Factory in Hadoop 2.7.1 is not thread safe
[ https://issues.apache.org/jira/browse/SPARK-15513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-15513. -- Resolution: Won't Fix > Bzip2Factory in Hadoop 2.7.1 is not thread safe > --- > > Key: SPARK-15513 > URL: https://issues.apache.org/jira/browse/SPARK-15513 > Project: Spark > Issue Type: Bug > Components: Spark Core > Environment: Hadoop 2.7.1 >Reporter: Yin Huai > > This is caused by https://issues.apache.org/jira/browse/HADOOP-12191. When we > are loading the native bzip2 lib by one thread, other threads think that > native bzip2 lib is not available and then throws exceptions. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 > (TID 37, localhost): java.lang.UnsupportedOperationException > at > org.apache.hadoop.io.compress.bzip2.BZip2DummyCompressor.finished(BZip2DummyCompressor.java:48) > at > org.apache.hadoop.io.compress.CompressorStream.write(CompressorStream.java:65) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at > org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:81) > at > org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:102) > at org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:95) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1205) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1278) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1211) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Suppressed: java.lang.UnsupportedOperationException > at > org.apache.hadoop.io.compress.bzip2.BZip2DummyCompressor.finished(BZip2DummyCompressor.java:48) > at > org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:89) > at > org.apache.hadoop.io.compress.CompressorStream.close(CompressorStream.java:106) > at java.io.FilterOutputStream.close(FilterOutputStream.java:159) > at > org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:108) > at > org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$8.apply$mcV$sp(PairRDDFunctions.scala:1211) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1296) > ... 8 more > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) > at > org.apache.spark.schedul
[jira] [Commented] (SPARK-15513) Bzip2Factory in Hadoop 2.7.1 is not thread safe
[ https://issues.apache.org/jira/browse/SPARK-15513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15684302#comment-15684302 ] Yin Huai commented on SPARK-15513: -- I am closing this jira since the fix has been released with 2.7.2. > Bzip2Factory in Hadoop 2.7.1 is not thread safe > --- > > Key: SPARK-15513 > URL: https://issues.apache.org/jira/browse/SPARK-15513 > Project: Spark > Issue Type: Bug > Components: Spark Core > Environment: Hadoop 2.7.1 >Reporter: Yin Huai > > This is caused by https://issues.apache.org/jira/browse/HADOOP-12191. When we > are loading the native bzip2 lib by one thread, other threads think that > native bzip2 lib is not available and then throws exceptions. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 > (TID 37, localhost): java.lang.UnsupportedOperationException > at > org.apache.hadoop.io.compress.bzip2.BZip2DummyCompressor.finished(BZip2DummyCompressor.java:48) > at > org.apache.hadoop.io.compress.CompressorStream.write(CompressorStream.java:65) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at > org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:81) > at > org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:102) > at org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:95) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1205) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1278) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1211) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Suppressed: java.lang.UnsupportedOperationException > at > org.apache.hadoop.io.compress.bzip2.BZip2DummyCompressor.finished(BZip2DummyCompressor.java:48) > at > org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:89) > at > org.apache.hadoop.io.compress.CompressorStream.close(CompressorStream.java:106) > at java.io.FilterOutputStream.close(FilterOutputStream.java:159) > at > org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:108) > at > org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$8.apply$mcV$sp(PairRDDFunctions.scala:1211) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1296) > ... 8 more > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGSche
[jira] [Comment Edited] (SPARK-15513) Bzip2Factory in Hadoop 2.7.1 is not thread safe
[ https://issues.apache.org/jira/browse/SPARK-15513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15684302#comment-15684302 ] Yin Huai edited comment on SPARK-15513 at 11/21/16 6:17 PM: I am closing this jira since the fix has been released with hadoop 2.7.2. was (Author: yhuai): I am closing this jira since the fix has been released with 2.7.2. > Bzip2Factory in Hadoop 2.7.1 is not thread safe > --- > > Key: SPARK-15513 > URL: https://issues.apache.org/jira/browse/SPARK-15513 > Project: Spark > Issue Type: Bug > Components: Spark Core > Environment: Hadoop 2.7.1 >Reporter: Yin Huai > > This is caused by https://issues.apache.org/jira/browse/HADOOP-12191. When we > are loading the native bzip2 lib by one thread, other threads think that > native bzip2 lib is not available and then throws exceptions. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 > (TID 37, localhost): java.lang.UnsupportedOperationException > at > org.apache.hadoop.io.compress.bzip2.BZip2DummyCompressor.finished(BZip2DummyCompressor.java:48) > at > org.apache.hadoop.io.compress.CompressorStream.write(CompressorStream.java:65) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at > org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:81) > at > org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:102) > at org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:95) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1205) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1278) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1211) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Suppressed: java.lang.UnsupportedOperationException > at > org.apache.hadoop.io.compress.bzip2.BZip2DummyCompressor.finished(BZip2DummyCompressor.java:48) > at > org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:89) > at > org.apache.hadoop.io.compress.CompressorStream.close(CompressorStream.java:106) > at java.io.FilterOutputStream.close(FilterOutputStream.java:159) > at > org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:108) > at > org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$8.apply$mcV$sp(PairRDDFunctions.scala:1211) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1296) > ... 8 more > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at scala.Option.foreach(Option.scala:236
[jira] [Resolved] (SPARK-18360) default table path of tables in default database should depend on the location of default database
[ https://issues.apache.org/jira/browse/SPARK-18360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18360. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15812 [https://github.com/apache/spark/pull/15812] > default table path of tables in default database should depend on the > location of default database > -- > > Key: SPARK-18360 > URL: https://issues.apache.org/jira/browse/SPARK-18360 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Labels: release_notes, releasenotes > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18360) default table path of tables in default database should depend on the location of default database
[ https://issues.apache.org/jira/browse/SPARK-18360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18360: - Labels: release_notes releasenotes (was: ) > default table path of tables in default database should depend on the > location of default database > -- > > Key: SPARK-18360 > URL: https://issues.apache.org/jira/browse/SPARK-18360 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Labels: release_notes, releasenotes > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18468) Flaky test: org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet relation with decimal column
[ https://issues.apache.org/jira/browse/SPARK-18468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18468: - Description: https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.1-test-sbt-hadoop-2.4/71/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/SPARK_9757_Persist_Parquet_relation_with_decimal_column/ https://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.4/71 Seems we failed to stop the driver {code} 2016-11-15 18:36:47.76 - stderr> org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout 2016-11-15 18:36:47.76 - stderr>at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) 2016-11-15 18:36:47.76 - stderr>at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) 2016-11-15 18:36:47.76 - stderr>at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) 2016-11-15 18:36:47.76 - stderr>at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) 2016-11-15 18:36:47.76 - stderr>at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216) 2016-11-15 18:36:47.76 - stderr>at scala.util.Try$.apply(Try.scala:192) 2016-11-15 18:36:47.76 - stderr>at scala.util.Failure.recover(Try.scala:216) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 2016-11-15 18:36:47.76 - stderr>at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Promise$class.complete(Promise.scala:55) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Promise$class.tryFailure(Promise.scala:112) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153) 2016-11-15 18:36:47.76 - stderr>at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205) 2016-11-15 18:36:47.76 - stderr>at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239) 2016-11-15 18:36:47.76 - stderr>at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 2016-11-15 18:36:47.76 - stderr>at java.util.concurrent.FutureTask.run(Futu
[jira] [Updated] (SPARK-18468) Flaky test: org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet relation with decimal column
[ https://issues.apache.org/jira/browse/SPARK-18468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18468: - Component/s: (was: SQL) Spark Core > Flaky test: org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist > Parquet relation with decimal column > -- > > Key: SPARK-18468 > URL: https://issues.apache.org/jira/browse/SPARK-18468 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Yin Huai >Priority: Critical > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.1-test-sbt-hadoop-2.4/71/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/SPARK_9757_Persist_Parquet_relation_with_decimal_column/ > Seems we failed to stop the driver > {code} > 2016-11-15 18:36:47.76 - stderr> org.apache.spark.rpc.RpcTimeoutException: > Cannot receive any reply in 120 seconds. This timeout is controlled by > spark.rpc.askTimeout > 2016-11-15 18:36:47.76 - stderr> at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > 2016-11-15 18:36:47.76 - stderr> at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > 2016-11-15 18:36:47.76 - stderr> at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > 2016-11-15 18:36:47.76 - stderr> at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > 2016-11-15 18:36:47.76 - stderr> at > scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216) > 2016-11-15 18:36:47.76 - stderr> at scala.util.Try$.apply(Try.scala:192) > 2016-11-15 18:36:47.76 - stderr> at > scala.util.Failure.recover(Try.scala:216) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > 2016-11-15 18:36:47.76 - stderr> at > com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Promise$class.complete(Promise.scala:55) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > 2016-11-15 18:36:47.76 - stderr> at > scala.concurrent.Promise$class.tryFailure(Promise.scala:112) > 2016-11-15 18:36:47.76 - st
[jira] [Resolved] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support
[ https://issues.apache.org/jira/browse/SPARK-18186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18186. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 15703 [https://github.com/apache/spark/pull/15703] > Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation > support > > > Key: SPARK-18186 > URL: https://issues.apache.org/jira/browse/SPARK-18186 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.0.1 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.2.0 > > > Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any > query involving any Hive UDAFs has to fall back to {{SortAggregateExec}} > without partial aggregation. > This issue can be fixed by migrating {{HiveUDAFFunction}} to > {{TypedImperativeAggregate}}, which already provides partial aggregation > support for aggregate functions that may use arbitrary Java objects as > aggregation states. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18468) Flaky test: org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet relation with decimal column
Yin Huai created SPARK-18468: Summary: Flaky test: org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet relation with decimal column Key: SPARK-18468 URL: https://issues.apache.org/jira/browse/SPARK-18468 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Yin Huai Priority: Critical https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.1-test-sbt-hadoop-2.4/71/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/SPARK_9757_Persist_Parquet_relation_with_decimal_column/ Seems we failed to stop the driver {code} 2016-11-15 18:36:47.76 - stderr> org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout 2016-11-15 18:36:47.76 - stderr>at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) 2016-11-15 18:36:47.76 - stderr>at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) 2016-11-15 18:36:47.76 - stderr>at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) 2016-11-15 18:36:47.76 - stderr>at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) 2016-11-15 18:36:47.76 - stderr>at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216) 2016-11-15 18:36:47.76 - stderr>at scala.util.Try$.apply(Try.scala:192) 2016-11-15 18:36:47.76 - stderr>at scala.util.Failure.recover(Try.scala:216) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 2016-11-15 18:36:47.76 - stderr>at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Promise$class.complete(Promise.scala:55) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.Promise$class.tryFailure(Promise.scala:112) 2016-11-15 18:36:47.76 - stderr>at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153) 2016-11-15 18:36:47.76 - stderr>at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205) 2016-11-15 18:36:47.76 - stderr>at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239) 2016-1
[jira] [Commented] (SPARK-18464) Spark SQL fails to load tables created without providing a schema
[ https://issues.apache.org/jira/browse/SPARK-18464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15669057#comment-15669057 ] Yin Huai commented on SPARK-18464: -- cc [~cloud_fan] > Spark SQL fails to load tables created without providing a schema > - > > Key: SPARK-18464 > URL: https://issues.apache.org/jira/browse/SPARK-18464 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yin Huai >Priority: Blocker > > I have a old table that was created without providing a schema. Seems branch > 2.1 fail to load it and says that the schema is corrupt. > With {{spark.sql.debug}} enabled, I get the metadata by using {{describe > formatted}}. > {code} > [col,array,from deserializer] > [,,] > [# Detailed Table Information,,] > [Database:,mydb,] > [Owner:,root,] > [Create Time:,Fri Jun 17 11:55:07 UTC 2016,] > [Last Access Time:,Thu Jan 01 00:00:00 UTC 1970,] > [Location:,mylocation,] > [Table Type:,EXTERNAL,] > [Table Parameters:,,] > [ transient_lastDdlTime,1466164507,] > [ spark.sql.sources.provider,parquet,] > [,,] > [# Storage Information,,] > [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,] > [InputFormat:,org.apache.hadoop.mapred.SequenceFileInputFormat,] > [OutputFormat:,org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,] > [Compressed:,No,] > [Storage Desc Parameters:,,] > [ path,/myPatch,] > [ serialization.format,1,] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org