[jira] [Commented] (SPARK-28898) SQL Configuration should be mentioned under Spark SQL in User Guide
[ https://issues.apache.org/jira/browse/SPARK-28898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919210#comment-16919210 ] Hyukjin Kwon commented on SPARK-28898: -- BTW, I wonder if there's a good way to automatically generate the doc from {{SQLConf.scala}} > SQL Configuration should be mentioned under Spark SQL in User Guide > --- > > Key: SPARK-28898 > URL: https://issues.apache.org/jira/browse/SPARK-28898 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Now if user gives set -v; then only End user will come to know about entire > list of the spark.sql.XX configuration. > For the End user unless familiar with Spark code can not use these configure > SQL Configuration entire list. > I feel this should be documented in 3.0 user guide for the entire list of SQL > Configuration like > [Spark-Streaming|https://spark.apache.org/docs/latest/configuration.html#spark-streaming] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28898) SQL Configuration should be mentioned under Spark SQL in User Guide
[ https://issues.apache.org/jira/browse/SPARK-28898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919209#comment-16919209 ] Hyukjin Kwon commented on SPARK-28898: -- I think we don't need to list entire list up but might be worth noting some public and important configurations. > SQL Configuration should be mentioned under Spark SQL in User Guide > --- > > Key: SPARK-28898 > URL: https://issues.apache.org/jira/browse/SPARK-28898 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Now if user gives set -v; then only End user will come to know about entire > list of the spark.sql.XX configuration. > For the End user unless familiar with Spark code can not use these configure > SQL Configuration entire list. > I feel this should be documented in 3.0 user guide for the entire list of SQL > Configuration like > [Spark-Streaming|https://spark.apache.org/docs/latest/configuration.html#spark-streaming] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28900) Test Pyspark, SparkR on JDK 11 with run-tests
[ https://issues.apache.org/jira/browse/SPARK-28900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919203#comment-16919203 ] Hyukjin Kwon edited comment on SPARK-28900 at 8/30/19 5:28 AM: --- Just quick FYI, {{spark-master-test-maven-hadoop-3.2}} builder would need to install pyarrow and pandas identically like PR builder ... I believe this could be done separately in a separate ticket if it matters cc [~yumwang] since you reached out to me about this. Once they are properly installed, those skipped tests will be executed properly (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/testReport/org.apache.spark.sql/SQLQueryTestSuite/) was (Author: hyukjin.kwon): Just quick FYI, {{spark-master-test-maven-hadoop-3.2]} builder would need to install pyarrow and pandas identically like PR builder ... I believe this could be done separately in a separate ticket if it matters cc [~yumwang] since you reached out to me about this. Once they are properly installed, those skipped tests will be executed properly (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/testReport/org.apache.spark.sql/SQLQueryTestSuite/) > Test Pyspark, SparkR on JDK 11 with run-tests > - > > Key: SPARK-28900 > URL: https://issues.apache.org/jira/browse/SPARK-28900 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Sean Owen >Priority: Major > > Right now, we are testing JDK 11 with a Maven-based build, as in > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/ > It looks like _all_ of the Maven-based jobs 'manually' build and invoke > tests, and only run tests via Maven -- that is, they do not run Pyspark or > SparkR tests. The SBT-based builds do, because they use the {{dev/run-tests}} > script that is meant to be for this purpose. > In fact, there seem to be a couple flavors of copy-pasted build configs. SBT > builds look like: > {code} > #!/bin/bash > set -e > # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention > export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER" > mkdir -p "$HOME" > export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2" > export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2" > # Add a pre-downloaded version of Maven to the path so that we avoid the > flaky download step. > export > PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH" > git clean -fdx > ./dev/run-tests > {code} > Maven builds looks like: > {code} > #!/bin/bash > set -x > set -e > rm -rf ./work > git clean -fdx > # Generate random point for Zinc > export ZINC_PORT > ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)") > # Use per-build-executor Ivy caches to avoid SBT Ivy lock contention: > export > SPARK_VERSIONS_SUITE_IVY_PATH="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER/.ivy2" > mkdir -p "$SPARK_VERSIONS_SUITE_IVY_PATH" > # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental > compiler seems to > # ignore our JAVA_HOME and use the system javac instead. > export PATH="$JAVA_HOME/bin:$PATH" > # Add a pre-downloaded version of Maven to the path so that we avoid the > flaky download step. > export > PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH" > MVN="build/mvn -DzincPort=$ZINC_PORT" > set +e > if [[ $HADOOP_PROFILE == hadoop-1 ]]; then > # Note that there is no -Pyarn flag here for Hadoop 1: > $MVN \ > -DskipTests \ > -P"$HADOOP_PROFILE" \ > -Dhadoop.version="$HADOOP_VERSION" \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > clean package > retcode1=$? > $MVN \ > -P"$HADOOP_PROFILE" \ > -Dhadoop.version="$HADOOP_VERSION" \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > --fail-at-end \ > test > retcode2=$? > else > $MVN \ > -DskipTests \ > -P"$HADOOP_PROFILE" \ > -Pyarn \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > clean package > retcode1=$? > $MVN \ > -P"$HADOOP_PROFILE" \ > -Pyarn \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > --fail-at-end \ > test > retcode2=$? > fi > if [[ $retcode1 -ne 0 || $retcode2 -ne 0 ]]; then > if [[ $retcode1 -ne 0 ]]; then > echo "Packaging Spark with Maven failed" > fi > if [[ $retcode2 -ne 0 ]]; then > echo "Testing Spark with Maven failed" >
[jira] [Commented] (SPARK-28900) Test Pyspark, SparkR on JDK 11 with run-tests
[ https://issues.apache.org/jira/browse/SPARK-28900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919203#comment-16919203 ] Hyukjin Kwon commented on SPARK-28900: -- Just quick FYI, {{spark-master-test-maven-hadoop-3.2]} builder would need to install pyarrow and pandas identically like PR builder ... I believe this could be done separately in a separate ticket if it matters cc [~yumwang] since you reached out to me about this. Once they are properly installed, those skipped tests will be executed properly (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/testReport/org.apache.spark.sql/SQLQueryTestSuite/) > Test Pyspark, SparkR on JDK 11 with run-tests > - > > Key: SPARK-28900 > URL: https://issues.apache.org/jira/browse/SPARK-28900 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Sean Owen >Priority: Major > > Right now, we are testing JDK 11 with a Maven-based build, as in > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/ > It looks like _all_ of the Maven-based jobs 'manually' build and invoke > tests, and only run tests via Maven -- that is, they do not run Pyspark or > SparkR tests. The SBT-based builds do, because they use the {{dev/run-tests}} > script that is meant to be for this purpose. > In fact, there seem to be a couple flavors of copy-pasted build configs. SBT > builds look like: > {code} > #!/bin/bash > set -e > # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention > export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER" > mkdir -p "$HOME" > export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2" > export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2" > # Add a pre-downloaded version of Maven to the path so that we avoid the > flaky download step. > export > PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH" > git clean -fdx > ./dev/run-tests > {code} > Maven builds looks like: > {code} > #!/bin/bash > set -x > set -e > rm -rf ./work > git clean -fdx > # Generate random point for Zinc > export ZINC_PORT > ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)") > # Use per-build-executor Ivy caches to avoid SBT Ivy lock contention: > export > SPARK_VERSIONS_SUITE_IVY_PATH="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER/.ivy2" > mkdir -p "$SPARK_VERSIONS_SUITE_IVY_PATH" > # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental > compiler seems to > # ignore our JAVA_HOME and use the system javac instead. > export PATH="$JAVA_HOME/bin:$PATH" > # Add a pre-downloaded version of Maven to the path so that we avoid the > flaky download step. > export > PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH" > MVN="build/mvn -DzincPort=$ZINC_PORT" > set +e > if [[ $HADOOP_PROFILE == hadoop-1 ]]; then > # Note that there is no -Pyarn flag here for Hadoop 1: > $MVN \ > -DskipTests \ > -P"$HADOOP_PROFILE" \ > -Dhadoop.version="$HADOOP_VERSION" \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > clean package > retcode1=$? > $MVN \ > -P"$HADOOP_PROFILE" \ > -Dhadoop.version="$HADOOP_VERSION" \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > --fail-at-end \ > test > retcode2=$? > else > $MVN \ > -DskipTests \ > -P"$HADOOP_PROFILE" \ > -Pyarn \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > clean package > retcode1=$? > $MVN \ > -P"$HADOOP_PROFILE" \ > -Pyarn \ > -Phive \ > -Phive-thriftserver \ > -Pkinesis-asl \ > -Pmesos \ > --fail-at-end \ > test > retcode2=$? > fi > if [[ $retcode1 -ne 0 || $retcode2 -ne 0 ]]; then > if [[ $retcode1 -ne 0 ]]; then > echo "Packaging Spark with Maven failed" > fi > if [[ $retcode2 -ne 0 ]]; then > echo "Testing Spark with Maven failed" > fi > exit 1 > fi > {code} > The PR builder (one of them at least) looks like: > {code} > #!/bin/bash > set -e # fail on any non-zero exit code > set -x > export AMPLAB_JENKINS=1 > export PATH="$PATH:/home/anaconda/envs/py3k/bin" > # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental > compiler seems to > # ignore our JAVA_HOME and use the system javac instead. > export PATH="$JAVA_HOME/bin:$PATH" > # Add a pre-downloaded version of Maven to the path so that we avoid the > flaky download step. > export >
[jira] [Issue Comment Deleted] (SPARK-28809) Document SHOW TABLE in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-28809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivu Sondur updated SPARK-28809: - Comment: was deleted (was: [~dkbiswal] i am not able to find the "SHOW TABLE" command in spark "SHOW TABLES " command is present, please recheck, whether this Jira is valid?) > Document SHOW TABLE in SQL Reference > > > Key: SPARK-28809 > URL: https://issues.apache.org/jira/browse/SPARK-28809 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: Dilip Biswal >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28906) `bin/spark-submit --version` shows incorrect info
[ https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919190#comment-16919190 ] Hyukjin Kwon commented on SPARK-28906: -- cc [~kiszk] FYI > `bin/spark-submit --version` shows incorrect info > - > > Key: SPARK-28906 > URL: https://issues.apache.org/jira/browse/SPARK-28906 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, > 3.0.0, 2.4.3 >Reporter: Marcelo Vanzin >Priority: Minor > Attachments: image-2019-08-29-05-50-13-526.png > > > Since Spark 2.3.1, `spark-submit` shows a wrong information. > {code} > $ bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.3 > /_/ > Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222 > Branch > Compiled by user on 2019-02-04T13:00:46Z > Revision > Url > Type --help for more information. > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28916) Generated SpecificSafeProjection.apply method grows beyond 64 KB when use SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-28916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919188#comment-16919188 ] Hyukjin Kwon commented on SPARK-28916: -- Reproducer: {code} val df = spark.range(100).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*) df.createOrReplaceTempView("spark64kb") val data = spark.sql("select * from spark64kb limit 10") data.describe() {code} > Generated SpecificSafeProjection.apply method grows beyond 64 KB when use > SparkSQL > --- > > Key: SPARK-28916 > URL: https://issues.apache.org/jira/browse/SPARK-28916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.3 >Reporter: MOBIN >Priority: Major > > Can be reproduced by the following steps: > 1. Create a table with 5000 fields > 2. val data=spark.sql("select * from spark64kb limit 10"); > 3. data.describe() > Then,The following error occurred > {code:java} > WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 0, localhost, > executor 1): org.codehaus.janino.InternalCompilerException: failed to > compile: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Code of method > "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection" > grows beyond 64 KB > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1298) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1376) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1373) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) > at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) > at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.generate(GenerateMutableProjection.scala:44) > at > org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:96) > at > org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:95) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:180) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:199) > at > org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:40) > at > org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86) > at > org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >
[jira] [Updated] (SPARK-28910) Prevent schema verification when connecting to in memory derby
[ https://issues.apache.org/jira/browse/SPARK-28910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28910: - Description: When {{hive.metastore.schema.verification=true}}, {{HiveUtils.newClientForExecution}} fails with {code} 19/08/14 13:26:55 WARN Hive: Failed to access metastore. This class should not accessed in runtime. org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:186) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:143) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:290) at org.apache.spark.sql.hive.HiveUtils$.newClientForExecution(HiveUtils.scala:275) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.startWithContext(HiveThriftServer2.scala:58) ... Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient {code} This prevents Thriftserver from starting was: When hive.metastore.schema.verification=true, HiveUtils.newClientForExecution fails with {{19/08/14 13:26:55 WARN Hive: Failed to access metastore. This class should not accessed in runtime. org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:186) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:143) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:290) at org.apache.spark.sql.hive.HiveUtils$.newClientForExecution(HiveUtils.scala:275) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.startWithContext(HiveThriftServer2.scala:58) ... Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient}} This prevents Thriftserver from starting > Prevent schema verification when connecting to in memory derby > -- > > Key: SPARK-28910 > URL: https://issues.apache.org/jira/browse/SPARK-28910 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Juliusz Sompolski >Priority: Major > > When {{hive.metastore.schema.verification=true}}, > {{HiveUtils.newClientForExecution}} fails with > {code} > 19/08/14 13:26:55 WARN Hive: Failed to access metastore. This class should > not accessed in runtime. > org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: > Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236) > at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) > at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) > at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) > at > org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:186) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:143) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:290) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForExecution(HiveUtils.scala:275) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.startWithContext(HiveThriftServer2.scala:58) > ... > Caused by: java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > {code} > This prevents Thriftserver from starting -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28910) Prevent schema verification when connecting to in memory derby
[ https://issues.apache.org/jira/browse/SPARK-28910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28910: - Component/s: (was: Spark Core) SQL > Prevent schema verification when connecting to in memory derby > -- > > Key: SPARK-28910 > URL: https://issues.apache.org/jira/browse/SPARK-28910 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Juliusz Sompolski >Priority: Major > > When {{hive.metastore.schema.verification=true}}, > {{HiveUtils.newClientForExecution}} fails with > {code} > 19/08/14 13:26:55 WARN Hive: Failed to access metastore. This class should > not accessed in runtime. > org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: > Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236) > at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) > at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) > at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) > at > org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:186) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:143) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:290) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForExecution(HiveUtils.scala:275) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.startWithContext(HiveThriftServer2.scala:58) > ... > Caused by: java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > {code} > This prevents Thriftserver from starting -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28861) Jetty property handling: java.lang.NumberFormatException: For input string: "unknown".
[ https://issues.apache.org/jira/browse/SPARK-28861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919183#comment-16919183 ] Hyukjin Kwon commented on SPARK-28861: -- ping [~Shrikhande] > Jetty property handling: java.lang.NumberFormatException: For input string: > "unknown". > -- > > Key: SPARK-28861 > URL: https://issues.apache.org/jira/browse/SPARK-28861 > Project: Spark > Issue Type: Wish > Components: Spark Submit >Affects Versions: 2.4.3 >Reporter: Ketan >Priority: Minor > > While processing data from certain files a {{NumberFormatExceltion}} was seen > in the logs. The processing was fine but the following stacktrace was > observed: > {code} > {"time":"2019-08-16 > 08:21:36,733","level":"DEBUG","class":"o.s.j.u.Jetty","message":"","thread":"Driver","appName":"app-name","appVersion":"APPLICATION_VERSION","type":"APPLICATION","errorCode":"ERROR_CODE","errorId":""} > java.lang.NumberFormatException: For input string: "unknown". > {code} > On investigation it is found that in the class Jetty there is the following: > {code} > BUILD_TIMESTAMP = formatTimestamp(__buildProperties.getProperty("timestamp", > "unknown")); > {code} > which indicates that the config should have the 'timestamp' property. If the > property is not there then the default value is set as 'unknown' and this > value causes the stacktrace to show up in the logs in our application. It has > no detrimental effect on the application as such but could be addressed. > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28912) MatchError exception in CheckpointWriteHandler
[ https://issues.apache.org/jira/browse/SPARK-28912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919182#comment-16919182 ] Hyukjin Kwon commented on SPARK-28912: -- It might be much easier for other poeple write a test and investigate further if there's a self-contained reproducer, if you're not going to open a PR. > MatchError exception in CheckpointWriteHandler > -- > > Key: SPARK-28912 > URL: https://issues.apache.org/jira/browse/SPARK-28912 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.3.2 >Reporter: Aleksandr Kashkirov >Priority: Minor > > Setting checkpoint directory name to "checkpoint-" plus some digits (e.g. > "checkpoint-01") results in the following error: > {code:java} > Exception in thread "pool-32-thread-1" scala.MatchError: > 0523a434-0daa-4ea6-a050-c4eb3c557d8c (of class java.lang.String) > at > org.apache.spark.streaming.Checkpoint$.org$apache$spark$streaming$Checkpoint$$sortFunc$1(Checkpoint.scala:121) > > at > org.apache.spark.streaming.Checkpoint$$anonfun$getCheckpointFiles$1.apply(Checkpoint.scala:132) > > at > org.apache.spark.streaming.Checkpoint$$anonfun$getCheckpointFiles$1.apply(Checkpoint.scala:132) > > at scala.math.Ordering$$anon$9.compare(Ordering.scala:200) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) > at java.util.TimSort.sort(TimSort.java:234) > at java.util.Arrays.sort(Arrays.java:1438) > at scala.collection.SeqLike$class.sorted(SeqLike.scala:648) > at scala.collection.mutable.ArrayOps$ofRef.sorted(ArrayOps.scala:186) > at scala.collection.SeqLike$class.sortWith(SeqLike.scala:601) > at scala.collection.mutable.ArrayOps$ofRef.sortWith(ArrayOps.scala:186) > at > org.apache.spark.streaming.Checkpoint$.getCheckpointFiles(Checkpoint.scala:132) > > at > org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:262) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748){code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28924) spark.jobGroup.id and spark.job.description is missing in Document
[ https://issues.apache.org/jira/browse/SPARK-28924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919181#comment-16919181 ] Sharanabasappa G Keriwaddi commented on SPARK-28924: [~abhishek.akg] , Thanks. I will check this and raise PR is required. > spark.jobGroup.id and spark.job.description is missing in Document > -- > > Key: SPARK-28924 > URL: https://issues.apache.org/jira/browse/SPARK-28924 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.4.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Require to update *spark.jobGroup.id* and *spark.job.description* in product > doc. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28899) merge the testing in-memory v2 catalogs from catalyst and core
[ https://issues.apache.org/jira/browse/SPARK-28899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919180#comment-16919180 ] Hyukjin Kwon commented on SPARK-28899: -- Fixed in https://github.com/apache/spark/pull/25610 > merge the testing in-memory v2 catalogs from catalyst and core > -- > > Key: SPARK-28899 > URL: https://issues.apache.org/jira/browse/SPARK-28899 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28924) spark.jobGroup.id and spark.job.description is missing in Document
ABHISHEK KUMAR GUPTA created SPARK-28924: Summary: spark.jobGroup.id and spark.job.description is missing in Document Key: SPARK-28924 URL: https://issues.apache.org/jira/browse/SPARK-28924 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.4.3 Reporter: ABHISHEK KUMAR GUPTA Require to update *spark.jobGroup.id* and *spark.job.description* in product doc. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28918) from_utc_timestamp function is mistakenly considering DST for Brazil in 2019
[ https://issues.apache.org/jira/browse/SPARK-28918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919173#comment-16919173 ] Hyukjin Kwon commented on SPARK-28918: -- BTW {{from_utc_timestamp}} is deprecated as of Spark 3.0. > from_utc_timestamp function is mistakenly considering DST for Brazil in 2019 > > > Key: SPARK-28918 > URL: https://issues.apache.org/jira/browse/SPARK-28918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: I'm using Spark through Databricks >Reporter: Luiz Hissashi da Rocha >Priority: Minor > > I realized that *from_utc_timestamp* function is assuming that Brazil will > have DST in 2019 but it will not, unlike previous years. Because of that, > when I run the function bellow, instead of having "2019-11-14" (São Paulo is > UTC-3h), I still get "2019-11-15T00:18:01" wrongly (as if it was UTC-2h due > to DST). > {code:java} > // from_utc_timestamp("2019-11-15T02:18:01.000+", 'America/Sao_Paulo') > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28918) from_utc_timestamp function is mistakenly considering DST for Brazil in 2019
[ https://issues.apache.org/jira/browse/SPARK-28918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28918. -- Resolution: Not A Problem > from_utc_timestamp function is mistakenly considering DST for Brazil in 2019 > > > Key: SPARK-28918 > URL: https://issues.apache.org/jira/browse/SPARK-28918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: I'm using Spark through Databricks >Reporter: Luiz Hissashi da Rocha >Priority: Minor > > I realized that *from_utc_timestamp* function is assuming that Brazil will > have DST in 2019 but it will not, unlike previous years. Because of that, > when I run the function bellow, instead of having "2019-11-14" (São Paulo is > UTC-3h), I still get "2019-11-15T00:18:01" wrongly (as if it was UTC-2h due > to DST). > {code:java} > // from_utc_timestamp("2019-11-15T02:18:01.000+", 'America/Sao_Paulo') > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28916) Generated SpecificSafeProjection.apply method grows beyond 64 KB when use SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-28916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28916: - Target Version/s: (was: 2.3.1, 2.4.3) > Generated SpecificSafeProjection.apply method grows beyond 64 KB when use > SparkSQL > --- > > Key: SPARK-28916 > URL: https://issues.apache.org/jira/browse/SPARK-28916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.3 >Reporter: MOBIN >Priority: Major > > Can be reproduced by the following steps: > 1. Create a table with 5000 fields > 2. val data=spark.sql("select * from spark64kb limit 10"); > 3. data.describe() > Then,The following error occurred > {code:java} > WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 0, localhost, > executor 1): org.codehaus.janino.InternalCompilerException: failed to > compile: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Code of method > "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection" > grows beyond 64 KB > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1298) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1376) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1373) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) > at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) > at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.generate(GenerateMutableProjection.scala:44) > at > org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:96) > at > org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:95) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:180) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:199) > at > org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:40) > at > org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86) > at > org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Code of method > "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class >
[jira] [Created] (SPARK-28923) Deduplicate the codes 'multipartIdentifier' and 'identifierSeq'
Xianyin Xin created SPARK-28923: --- Summary: Deduplicate the codes 'multipartIdentifier' and 'identifierSeq' Key: SPARK-28923 URL: https://issues.apache.org/jira/browse/SPARK-28923 Project: Spark Issue Type: Request Components: SQL Affects Versions: 3.0.0 Reporter: Xianyin Xin In {{sqlbase.g4}}, {{multipartIdentifier}} and {{identifierSeq}} have the same functionality. We'd better deduplicate them. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28920) Set up java version for github workflow
[ https://issues.apache.org/jira/browse/SPARK-28920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28920. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25625 [https://github.com/apache/spark/pull/25625] > Set up java version for github workflow > --- > > Key: SPARK-28920 > URL: https://issues.apache.org/jira/browse/SPARK-28920 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > We added java matrix to github workflow. As we want to build with JDK8/11, we > should set up java version for mvn. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28920) Set up java version for github workflow
[ https://issues.apache.org/jira/browse/SPARK-28920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28920: - Assignee: Liang-Chi Hsieh > Set up java version for github workflow > --- > > Key: SPARK-28920 > URL: https://issues.apache.org/jira/browse/SPARK-28920 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > > We added java matrix to github workflow. As we want to build with JDK8/11, we > should set up java version for mvn. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28919) Add more profiles for JDK8/11 build test
[ https://issues.apache.org/jira/browse/SPARK-28919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28919. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25624 [https://github.com/apache/spark/pull/25624] > Add more profiles for JDK8/11 build test > > > Key: SPARK-28919 > URL: https://issues.apache.org/jira/browse/SPARK-28919 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > From Jenkins, we build with JDK8 and test with JDK8/11. > We are testing JDK11 build via GitHub workflow. This issue aims to add more > profiles. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28919) Add more profiles for JDK8/11 build test
[ https://issues.apache.org/jira/browse/SPARK-28919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28919: - Assignee: Dongjoon Hyun > Add more profiles for JDK8/11 build test > > > Key: SPARK-28919 > URL: https://issues.apache.org/jira/browse/SPARK-28919 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > From Jenkins, we build with JDK8 and test with JDK8/11. > We are testing JDK11 build via GitHub workflow. This issue aims to add more > profiles. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28922) Safe Kafka parameter redaction
[ https://issues.apache.org/jira/browse/SPARK-28922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28922: - Assignee: Gabor Somogyi > Safe Kafka parameter redaction > -- > > Key: SPARK-28922 > URL: https://issues.apache.org/jira/browse/SPARK-28922 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Minor > > At the moment Kafka parameter reduction is expecting SparkEnv. This must > exist in normal queries but several unit tests are not providing it to make > things simple. As an end-result such tests are throwing similar exception: -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28922) Safe Kafka parameter redaction
[ https://issues.apache.org/jira/browse/SPARK-28922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28922. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25621 [https://github.com/apache/spark/pull/25621] > Safe Kafka parameter redaction > -- > > Key: SPARK-28922 > URL: https://issues.apache.org/jira/browse/SPARK-28922 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Minor > Fix For: 3.0.0 > > > At the moment Kafka parameter reduction is expecting SparkEnv. This must > exist in normal queries but several unit tests are not providing it to make > things simple. As an end-result such tests are throwing similar exception: -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28922) Safe Kafka parameter redaction
[ https://issues.apache.org/jira/browse/SPARK-28922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28922: -- Reporter: Gabor Somogyi (was: Dongjoon Hyun) > Safe Kafka parameter redaction > -- > > Key: SPARK-28922 > URL: https://issues.apache.org/jira/browse/SPARK-28922 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > At the moment Kafka parameter reduction is expecting SparkEnv. This must > exist in normal queries but several unit tests are not providing it to make > things simple. As an end-result such tests are throwing similar exception: -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28922) Safe Kafka parameter redaction
Dongjoon Hyun created SPARK-28922: - Summary: Safe Kafka parameter redaction Key: SPARK-28922 URL: https://issues.apache.org/jira/browse/SPARK-28922 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Dongjoon Hyun At the moment Kafka parameter reduction is expecting SparkEnv. This must exist in normal queries but several unit tests are not providing it to make things simple. As an end-result such tests are throwing similar exception: -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Schweigert updated SPARK-28921: Description: Spark jobs are failing on latest versions of Kubernetes when jobs attempt to provision executor pods (jobs like Spark-Pi that do not launch executors run without a problem): Here's an example error message: {code:java} 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes. 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: HTTP 403, Status: 403 - java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} Looks like the issue is caused by the internal master Kubernetes url not having the port specified: [https://github.com/apache/spark/blob/master//resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L82:7] Using the master with the port (443) seems to fix the problem. was: Spark jobs are failing on latest versions of Kubernetes when jobs attempt to provision executor pods (jobs like Spark-Pi that do not launch executors run without a problem): Here's an example error message: {code:java} 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: HTTP 403, Status: 403 - java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} Looks like the issue is caused by the internal master Kubernetes url not having the port specified: [https://github.com/apache/spark/blob/master//resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L82:7] Using the master with the port (443) seems to fix the problem. > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10) > - > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Paul Schweigert >Priority: Minor > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by the internal master Kubernetes url not > having the port specified: > [https://github.com/apache/spark/blob/master//resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L82:7] > > Using the master with the port (443) seems to fix the problem. > -- This message
[jira] [Created] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)
Paul Schweigert created SPARK-28921: --- Summary: Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10) Key: SPARK-28921 URL: https://issues.apache.org/jira/browse/SPARK-28921 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.4.3 Reporter: Paul Schweigert Spark jobs are failing on latest versions of Kubernetes when jobs attempt to provision executor pods (jobs like Spark-Pi that do not launch executors run without a problem): Here's an example error message: {code:java} 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: HTTP 403, Status: 403 - java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} Looks like the issue is caused by the internal master Kubernetes url not having the port specified: [https://github.com/apache/spark/blob/master//resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L82:7] Using the master with the port (443) seems to fix the problem. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28843) Set OMP_NUM_THREADS to executor cores reduce Python memory consumption
[ https://issues.apache.org/jira/browse/SPARK-28843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28843. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25545 [https://github.com/apache/spark/pull/25545] > Set OMP_NUM_THREADS to executor cores reduce Python memory consumption > -- > > Key: SPARK-28843 > URL: https://issues.apache.org/jira/browse/SPARK-28843 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.3, 3.0.0, 2.4.3 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > While testing hardware with more cores, we found that the amount of memory > required by PySpark applications increased and tracked the problem to > importing numpy. The numpy issue is > [https://github.com/numpy/numpy/issues/10455] > NumPy uses OpenMP that starts a thread pool with the number of cores on the > machine (and does not respect cgroups). When we set this lower we see a > significant reduction in memory consumption. > This parallelism setting should be set to the number of cores allocated to > the executor, not the number of cores available. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28843) Set OMP_NUM_THREADS to executor cores reduce Python memory consumption
[ https://issues.apache.org/jira/browse/SPARK-28843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28843: Assignee: Ryan Blue > Set OMP_NUM_THREADS to executor cores reduce Python memory consumption > -- > > Key: SPARK-28843 > URL: https://issues.apache.org/jira/browse/SPARK-28843 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.3, 3.0.0, 2.4.3 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Labels: release-notes > > While testing hardware with more cores, we found that the amount of memory > required by PySpark applications increased and tracked the problem to > importing numpy. The numpy issue is > [https://github.com/numpy/numpy/issues/10455] > NumPy uses OpenMP that starts a thread pool with the number of cores on the > machine (and does not respect cgroups). When we set this lower we see a > significant reduction in memory consumption. > This parallelism setting should be set to the number of cores allocated to > the executor, not the number of cores available. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28920) Set up java version for github workflow
Liang-Chi Hsieh created SPARK-28920: --- Summary: Set up java version for github workflow Key: SPARK-28920 URL: https://issues.apache.org/jira/browse/SPARK-28920 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh We added java matrix to github workflow. As we want to build with JDK8/11, we should set up java version for mvn. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28918) from_utc_timestamp function is mistakenly considering DST for Brazil in 2019
[ https://issues.apache.org/jira/browse/SPARK-28918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919045#comment-16919045 ] Shivu Sondur commented on SPARK-28918: -- [~hissashirocha] Brazil DST removed on Sunday, 17 February 2019. You need to update your jvm TZupdater. Refer following link link https://www.oracle.com/technetwork/java/javase/timezones-137583.html#tzu > from_utc_timestamp function is mistakenly considering DST for Brazil in 2019 > > > Key: SPARK-28918 > URL: https://issues.apache.org/jira/browse/SPARK-28918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: I'm using Spark through Databricks >Reporter: Luiz Hissashi da Rocha >Priority: Minor > > I realized that *from_utc_timestamp* function is assuming that Brazil will > have DST in 2019 but it will not, unlike previous years. Because of that, > when I run the function bellow, instead of having "2019-11-14" (São Paulo is > UTC-3h), I still get "2019-11-15T00:18:01" wrongly (as if it was UTC-2h due > to DST). > {code:java} > // from_utc_timestamp("2019-11-15T02:18:01.000+", 'America/Sao_Paulo') > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28919) Add more profiles for JDK8/11 build test
[ https://issues.apache.org/jira/browse/SPARK-28919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28919: -- Description: >From Jenkins, we build with JDK8 and test with JDK8/11. We are testing JDK11 build via GitHub workflow. This issue aims to add more profiles. was: >From Jenkins, we build with JDK8 and test with JDK8/11. This issue aims to add JDK11 build test to GitHub workflow. > Add more profiles for JDK8/11 build test > > > Key: SPARK-28919 > URL: https://issues.apache.org/jira/browse/SPARK-28919 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > From Jenkins, we build with JDK8 and test with JDK8/11. > We are testing JDK11 build via GitHub workflow. This issue aims to add more > profiles. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28919) Add more profiles for JDK8/11 build test
Dongjoon Hyun created SPARK-28919: - Summary: Add more profiles for JDK8/11 build test Key: SPARK-28919 URL: https://issues.apache.org/jira/browse/SPARK-28919 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.0.0 Reporter: Dongjoon Hyun >From Jenkins, we build with JDK8 and test with JDK8/11. This issue aims to add JDK11 build test to GitHub workflow. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28918) from_utc_timestamp function is mistakenly considering DST for Brazil in 2019
Luiz Hissashi da Rocha created SPARK-28918: -- Summary: from_utc_timestamp function is mistakenly considering DST for Brazil in 2019 Key: SPARK-28918 URL: https://issues.apache.org/jira/browse/SPARK-28918 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Environment: I'm using Spark through Databricks Reporter: Luiz Hissashi da Rocha I realized that *from_utc_timestamp* function is assuming that Brazil will have DST in 2019 but it will not, unlike previous years. Because of that, when I run the function bellow, instead of having "2019-11-14" (São Paulo is UTC-3h), I still get "2019-11-15T00:18:01" wrongly (as if it was UTC-2h due to DST). {code:java} // from_utc_timestamp("2019-11-15T02:18:01.000+", 'America/Sao_Paulo') {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28899) merge the testing in-memory v2 catalogs from catalyst and core
[ https://issues.apache.org/jira/browse/SPARK-28899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-28899. --- Fix Version/s: 3.0.0 Resolution: Fixed > merge the testing in-memory v2 catalogs from catalyst and core > -- > > Key: SPARK-28899 > URL: https://issues.apache.org/jira/browse/SPARK-28899 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28760) Add end-to-end Kafka delegation token test
[ https://issues.apache.org/jira/browse/SPARK-28760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-28760. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25477 [https://github.com/apache/spark/pull/25477] > Add end-to-end Kafka delegation token test > -- > > Key: SPARK-28760 > URL: https://issues.apache.org/jira/browse/SPARK-28760 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming, Tests >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.0.0 > > > At the moment no end-to-end Kafka delegation token test exists which was > mainly because of missing KDC. KDC is missing in general from the testing > side so I've discovered what kind of possibilities are there. The most > obvious choice is the MiniKDC inside the Hadoop library where Apache Kerby > runs in the background. In this jira I would like to add Kerby to the testing > area and use it to cover security related features. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28760) Add end-to-end Kafka delegation token test
[ https://issues.apache.org/jira/browse/SPARK-28760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-28760: -- Assignee: Gabor Somogyi > Add end-to-end Kafka delegation token test > -- > > Key: SPARK-28760 > URL: https://issues.apache.org/jira/browse/SPARK-28760 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming, Tests >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > > At the moment no end-to-end Kafka delegation token test exists which was > mainly because of missing KDC. KDC is missing in general from the testing > side so I've discovered what kind of possibilities are there. The most > obvious choice is the MiniKDC inside the Hadoop library where Apache Kerby > runs in the background. In this jira I would like to add Kerby to the testing > area and use it to cover security related features. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25874) Simplify abstractions in the K8S backend
[ https://issues.apache.org/jira/browse/SPARK-25874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918813#comment-16918813 ] Marcelo Vanzin commented on SPARK-25874: I'm going to close this one; the only remaining task is for adding (internal developer) documentation, which is minor. > Simplify abstractions in the K8S backend > > > Key: SPARK-25874 > URL: https://issues.apache.org/jira/browse/SPARK-25874 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > I spent some time recently re-familiarizing myself with the k8s backend, and > I think there is room for improvement. In the past, SPARK-22839 was added > which improved things a lot, but it is still hard to follow the code, and it > is still more complicated than it should be to add a new feature. > I've worked on the main things that were bothering me and came up with these > changes: > https://github.com/vanzin/spark/commits/k8s-simple > Now that patch (first commit of the branch) is a little large, which makes it > hard to review and to properly assess what it is doing. So I plan to break it > down into a few steps that I will file as sub-tasks and send for review > independently. > The commit message for that patch has a lot of the background of what I > changed and why. Since I plan to delete that branch after the work is done, > I'll paste it here: > {noformat} > There are two main changes happening here. > (1) Simplify the KubernetesConf abstraction. > The current code around KubernetesConf has a few drawbacks: > - it uses composition (with a type parameter) for role-specific configuration > - it breaks encapsulation of the user configuration, held in SparkConf, by > requiring that all the k8s-specific info is extracted from SparkConf before > the KubernetesConf object is created. > - the above is usually done by parsing the SparkConf info into k8s-backend > types, which are then transformed into k8s requests. > This ends up requiring a whole lot of code that is just not necessary. > The type parameters make parameter and class declarations full of needless > noise; the breakage of encapsulation makes the code that processes SparkConf > and the code that builds the k8s descriptors live in different places, and > the intermediate representation isn't adding much value. > By using inheritance instead of the current model, role-specific > specialization of certain config properties works simply by implementing some > abstract methods of the base class (instead of breaking encapsulation), and > there's no need anymore for parameterized types. > By moving config processing to the code that actually transforms the config > into k8s descriptors, a lot of intermediate boilerplate can be removed. > This leads to... > (2) Make all feature logic part of the feature step itself. > Currently there's code in a lot of places to decide whether a feature > should be enabled. There's code when parsing the configuration, building > the custom intermediate representation in a way that is later used by > different code in a builder class, which then decides whether feature A > or feature B should be used. > Instead, it's much cleaner to let a feature decide things for itself. > If the config to enable feature A exists, it proceses the config and > sets up the necessary k8s descriptors. If it doesn't, the feature is > a no-op. > This simplifies the shared code that calls into the existing features > a lot. And does not make the existing features any more complicated. > As part of this I merged the different language binding feature steps > into a single step. They are sort of related, in the sense that if > one is applied the others shouldn't, and merging them makes the logic > to implement that cleaner. > The driver and executor builders are now also much simpler, since they > have no logic about what steps to apply or not. The tests were removed > because of that, and some new tests were added to the suites for > specific features, to verify what the old builder suites were testing. > On top of the above I made a few minor changes (in comparison): > - KubernetesVolumeUtils was modified to just throw exceptions. The old > code tried to avoid throwing exceptions by collecting results in `Try` > objects. That was not achieving anything since all the callers would > just call `get` on those objects, and the first one with a failure > would just throw the exception. The new code achieves the same > behavior and is simpler. > - A bunch of small things, mainly to bring the code in line with the > usual Spark code style. I also removed unnecessary mocking in tests, > unused imports, and unused configs and constants. > - Added some basic tests for
[jira] [Resolved] (SPARK-25874) Simplify abstractions in the K8S backend
[ https://issues.apache.org/jira/browse/SPARK-25874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-25874. Fix Version/s: 3.0.0 Resolution: Done > Simplify abstractions in the K8S backend > > > Key: SPARK-25874 > URL: https://issues.apache.org/jira/browse/SPARK-25874 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > Fix For: 3.0.0 > > > I spent some time recently re-familiarizing myself with the k8s backend, and > I think there is room for improvement. In the past, SPARK-22839 was added > which improved things a lot, but it is still hard to follow the code, and it > is still more complicated than it should be to add a new feature. > I've worked on the main things that were bothering me and came up with these > changes: > https://github.com/vanzin/spark/commits/k8s-simple > Now that patch (first commit of the branch) is a little large, which makes it > hard to review and to properly assess what it is doing. So I plan to break it > down into a few steps that I will file as sub-tasks and send for review > independently. > The commit message for that patch has a lot of the background of what I > changed and why. Since I plan to delete that branch after the work is done, > I'll paste it here: > {noformat} > There are two main changes happening here. > (1) Simplify the KubernetesConf abstraction. > The current code around KubernetesConf has a few drawbacks: > - it uses composition (with a type parameter) for role-specific configuration > - it breaks encapsulation of the user configuration, held in SparkConf, by > requiring that all the k8s-specific info is extracted from SparkConf before > the KubernetesConf object is created. > - the above is usually done by parsing the SparkConf info into k8s-backend > types, which are then transformed into k8s requests. > This ends up requiring a whole lot of code that is just not necessary. > The type parameters make parameter and class declarations full of needless > noise; the breakage of encapsulation makes the code that processes SparkConf > and the code that builds the k8s descriptors live in different places, and > the intermediate representation isn't adding much value. > By using inheritance instead of the current model, role-specific > specialization of certain config properties works simply by implementing some > abstract methods of the base class (instead of breaking encapsulation), and > there's no need anymore for parameterized types. > By moving config processing to the code that actually transforms the config > into k8s descriptors, a lot of intermediate boilerplate can be removed. > This leads to... > (2) Make all feature logic part of the feature step itself. > Currently there's code in a lot of places to decide whether a feature > should be enabled. There's code when parsing the configuration, building > the custom intermediate representation in a way that is later used by > different code in a builder class, which then decides whether feature A > or feature B should be used. > Instead, it's much cleaner to let a feature decide things for itself. > If the config to enable feature A exists, it proceses the config and > sets up the necessary k8s descriptors. If it doesn't, the feature is > a no-op. > This simplifies the shared code that calls into the existing features > a lot. And does not make the existing features any more complicated. > As part of this I merged the different language binding feature steps > into a single step. They are sort of related, in the sense that if > one is applied the others shouldn't, and merging them makes the logic > to implement that cleaner. > The driver and executor builders are now also much simpler, since they > have no logic about what steps to apply or not. The tests were removed > because of that, and some new tests were added to the suites for > specific features, to verify what the old builder suites were testing. > On top of the above I made a few minor changes (in comparison): > - KubernetesVolumeUtils was modified to just throw exceptions. The old > code tried to avoid throwing exceptions by collecting results in `Try` > objects. That was not achieving anything since all the callers would > just call `get` on those objects, and the first one with a failure > would just throw the exception. The new code achieves the same > behavior and is simpler. > - A bunch of small things, mainly to bring the code in line with the > usual Spark code style. I also removed unnecessary mocking in tests, > unused imports, and unused configs and constants. > - Added some basic tests for KerberosConfDriverFeatureStep. > Note that there may still be leftover intermediate
[jira] [Resolved] (SPARK-27611) Redundant javax.activation dependencies in the Maven build
[ https://issues.apache.org/jira/browse/SPARK-27611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27611. --- Fix Version/s: 3.0.0 Resolution: Fixed Resolved by https://github.com/apache/spark/pull/24507 > Redundant javax.activation dependencies in the Maven build > -- > > Key: SPARK-27611 > URL: https://issues.apache.org/jira/browse/SPARK-27611 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > Fix For: 3.0.0 > > > [PR #23890|https://github.com/apache/spark/pull/23890] introduced > {{org.glassfish.jaxb:jaxb-runtime:2.3.2}} as a runtime dependency. As an > unexpected side effect, {{jakarta.activation:jakarta.activation-api:1.2.1}} > was also pulled in as a transitive dependency. As a result, for the Maven > build, both of the following two jars can be found under > {{assembly/target/scala-2.12/jars}}: > {noformat} > activation-1.1.1.jar > jakarta.activation-api-1.2.1.jar > {noformat} > Discussed this with [~srowen] offline and we agreed that we should probably > exclude the Jakarta one. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28807) Document SHOW DATABASES in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-28807. - Fix Version/s: 3.0.0 Assignee: Dilip Biswal Resolution: Fixed > Document SHOW DATABASES in SQL Reference. > - > > Key: SPARK-28807 > URL: https://issues.apache.org/jira/browse/SPARK-28807 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28917) Jobs can hang because of race of RDD.dependencies
Imran Rashid created SPARK-28917: Summary: Jobs can hang because of race of RDD.dependencies Key: SPARK-28917 URL: https://issues.apache.org/jira/browse/SPARK-28917 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.4.3, 2.3.3 Reporter: Imran Rashid {{RDD.dependencies}} stores the precomputed cache value, but it is not thread-safe. This can lead to a race where the value gets overwritten, but the DAGScheduler gets stuck in an inconsistent state. In particular, this can happen when there is a race between the DAGScheduler event loop, and another thread (eg. a user thread, if there is multi-threaded job submission). First, a job is submitted by the user, which then computes the result Stage and its parents: https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983 Which eventually makes a call to {{rdd.dependencies}}: https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519 At the same time, the user could also touch {{rdd.dependencies}} in another thread, which could overwrite the stored value because of the race. Then the DAGScheduler checks the dependencies *again* later on in the job submission, via {{getMissingParentStages}} https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025 Because it will find new dependencies, it will create entirely different stages. Now the job has some orphaned stages which will never run. The symptoms of this are seeing disjoint sets of stages in the "Parents of final stage" and the "Missing parents" messages on job submission, as well as seeing repeated messages "Registered RDD X" for the same RDD id. eg: {noformat} [INFO] 2019-08-15 23:22:31,570 org.apache.spark.SparkContext logInfo - Starting job: count at XXX.scala:462 ... [INFO] 2019-08-15 23:22:31,573 org.apache.spark.scheduler.DAGScheduler logInfo - Registering RDD 14 (repartition at XXX.scala:421) ... ... [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo - Got job 1 (count at XXX.scala:462) with 40 output partitions [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo - Final stage: ResultStage 5 (count at XXX.scala:462) [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo - Parents of final stage: List(ShuffleMapStage 4) [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler logInfo - Registering RDD 14 (repartition at XXX.scala:421) [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler logInfo - Missing parents: List(ShuffleMapStage 6) {noformat} Note that there is a similar issue w/ {{rdd.partitions}}. I don't see a way it could mess up the scheduler (seems its only used for {{rdd.partitions.length}}). There is also an issue that {{rdd.storageLevel}} is read and cached in the scheduler, but it could be modified simultaneously by the user in another thread. Similarly, I can't see a way it could effect the scheduler. *WORKAROUND*: (a) call {{rdd.dependencies}} while you know that RDD is only getting touched by one thread (eg. in the thread that created it, or before you submit multiple jobs touching that RDD from other threads). Then that value will get cached. (b) don't submit jobs from multiple threads. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28786) Document INSERT statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-28786. - Fix Version/s: 3.0.0 Assignee: Huaxin Gao Resolution: Fixed > Document INSERT statement in SQL Reference. > --- > > Key: SPARK-28786 > URL: https://issues.apache.org/jira/browse/SPARK-28786 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28917) Jobs can hang because of race of RDD.dependencies
[ https://issues.apache.org/jira/browse/SPARK-28917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918764#comment-16918764 ] Imran Rashid commented on SPARK-28917: -- [~markhamstra] [~jiangxb1987] [~tgraves] [~Ngone51] would appreciate your thoughts on this. I think the bug I've described above is pretty clear. However, the part which I'm wondering about a bit more is whether there is more mutability in RDD that could cause problems. For the case I have of this, I only know for sure that the user is calling {{rdd.cache()}} in another thread. But I can't see how that would leave to the symptoms I describe above. I don't know that they are doing anything in ther user thread which would touch {{rdd.dependencies}}, but I also don't have full visibility into everything they are doing, so this still seems like the best explanation to me. > Jobs can hang because of race of RDD.dependencies > - > > Key: SPARK-28917 > URL: https://issues.apache.org/jira/browse/SPARK-28917 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.3, 2.4.3 >Reporter: Imran Rashid >Priority: Major > > {{RDD.dependencies}} stores the precomputed cache value, but it is not > thread-safe. This can lead to a race where the value gets overwritten, but > the DAGScheduler gets stuck in an inconsistent state. In particular, this > can happen when there is a race between the DAGScheduler event loop, and > another thread (eg. a user thread, if there is multi-threaded job submission). > First, a job is submitted by the user, which then computes the result Stage > and its parents: > https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983 > Which eventually makes a call to {{rdd.dependencies}}: > https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519 > At the same time, the user could also touch {{rdd.dependencies}} in another > thread, which could overwrite the stored value because of the race. > Then the DAGScheduler checks the dependencies *again* later on in the job > submission, via {{getMissingParentStages}} > https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025 > Because it will find new dependencies, it will create entirely different > stages. Now the job has some orphaned stages which will never run. > The symptoms of this are seeing disjoint sets of stages in the "Parents of > final stage" and the "Missing parents" messages on job submission, as well as > seeing repeated messages "Registered RDD X" for the same RDD id. eg: > {noformat} > [INFO] 2019-08-15 23:22:31,570 org.apache.spark.SparkContext logInfo - > Starting job: count at XXX.scala:462 > ... > [INFO] 2019-08-15 23:22:31,573 org.apache.spark.scheduler.DAGScheduler > logInfo - Registering RDD 14 (repartition at XXX.scala:421) > ... > ... > [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler > logInfo - Got job 1 (count at XXX.scala:462) with 40 output partitions > [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler > logInfo - Final stage: ResultStage 5 (count at XXX.scala:462) > [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler > logInfo - Parents of final stage: List(ShuffleMapStage 4) > [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler > logInfo - Registering RDD 14 (repartition at XXX.scala:421) > [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler > logInfo - Missing parents: List(ShuffleMapStage 6) > {noformat} > Note that there is a similar issue w/ {{rdd.partitions}}. I don't see a way > it could mess up the scheduler (seems its only used for > {{rdd.partitions.length}}). There is also an issue that {{rdd.storageLevel}} > is read and cached in the scheduler, but it could be modified simultaneously > by the user in another thread. Similarly, I can't see a way it could effect > the scheduler. > *WORKAROUND*: > (a) call {{rdd.dependencies}} while you know that RDD is only getting touched > by one thread (eg. in the thread that created it, or before you submit > multiple jobs touching that RDD from other threads). Then that value will get > cached. > (b) don't submit jobs from multiple threads. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28916) Generated SpecificSafeProjection.apply method grows beyond 64 KB when use SparkSQL
MOBIN created SPARK-28916: - Summary: Generated SpecificSafeProjection.apply method grows beyond 64 KB when use SparkSQL Key: SPARK-28916 URL: https://issues.apache.org/jira/browse/SPARK-28916 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3, 2.3.1 Reporter: MOBIN Can be reproduced by the following steps: 1. Create a table with 5000 fields 2. val data=spark.sql("select * from spark64kb limit 10"); 3. data.describe() Then,The following error occurred {code:java} WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 0, localhost, executor 1): org.codehaus.janino.InternalCompilerException: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection" grows beyond 64 KB at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1298) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1376) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1373) at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.generate(GenerateMutableProjection.scala:44) at org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:385) at org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:96) at org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:95) at org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:180) at org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:199) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:40) at org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86) at org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection" grows beyond 64 KB at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465) at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) at
[jira] [Updated] (SPARK-28915) The new keyword is not used when instantiating the WorkerOffer.
[ https://issues.apache.org/jira/browse/SPARK-28915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaqi Li updated SPARK-28915: - Description: WorkerOffer is a case class, so we don't need to use new when instantiating it, which makes code more concise. {code:java} private[spark] case class WorkerOffer( executorId: String, host: String, cores: Int, address: Option[String] = None, resources: Map[String, Buffer[String]] = Map.empty) {code} was:`WorkerOffer` is a case class, so we don't need to use `new` when instantiating it, which makes code more concise. > The new keyword is not used when instantiating the WorkerOffer. > --- > > Key: SPARK-28915 > URL: https://issues.apache.org/jira/browse/SPARK-28915 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Jiaqi Li >Priority: Minor > > WorkerOffer is a case class, so we don't need to use new when instantiating > it, which makes code more concise. > {code:java} > private[spark] > case class WorkerOffer( > executorId: String, > host: String, > cores: Int, > address: Option[String] = None, > resources: Map[String, Buffer[String]] = Map.empty) > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28915) The new keyword is not used when instantiating the WorkerOffer.
Jiaqi Li created SPARK-28915: Summary: The new keyword is not used when instantiating the WorkerOffer. Key: SPARK-28915 URL: https://issues.apache.org/jira/browse/SPARK-28915 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.3 Reporter: Jiaqi Li `WorkerOffer` is a case class, so we don't need to use `new` when instantiating it, which makes code more concise. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28914) GMM Returning Clusters With Equal Probability
Nirek Sharma created SPARK-28914: Summary: GMM Returning Clusters With Equal Probability Key: SPARK-28914 URL: https://issues.apache.org/jira/browse/SPARK-28914 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 2.4.3 Environment: https://gist.github.com/sharmanirek/172c9408e8393462ae54dfae83764413 Test Data & notebook: https://drive.google.com/open?id=1YEBsUJv9p2XaTQzsKI9GxLKVD7oe1tan Reporter: Nirek Sharma Fitting and Transforming with a GMM yields all categories with equal probability. See the below gist demonstrating the code to reproduce the problem. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28913) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Wang updated SPARK-28913: --- Description: The stack trace is below: {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 BlockManager: Block rdd_10916_493 could not be removed as it was not found on disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) java.lang.ArrayIndexOutOfBoundsException: 6741 at org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) at org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This exception happened sometimes. And we also found that the AUC metric was not stable when evaluating the inner product of the user factors and the item factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 which was not stable for production environment. Dataset capacity: ~12 billion ratings Here is the our code: {code:java} val hivedata = sc.sql(sqltext).select(id,dpid,score).coalesce(numPartitions) val predataItem = hivedata.rdd.map(r=>(r._1._1,(r._1._2,r._2.sum))) .groupByKey().zipWithIndex() .persist(StorageLevel.MEMORY_AND_DISK_SER) val predataUser = predataItem.flatMap(r=>r._1._2.map(y=>(y._1,(r._2.toInt,y._2 .aggregateByKey(zeroValueArr,numPartitions)((a,b)=> a += b,(a,b)=>a ++ b).map(r=>(r._1,r._2.toIterable)) .zipWithIndex().persist(StorageLevel.MEMORY_AND_DISK_SER) //x._2 is the item_id, y._1 is the user_id, y._2 is the rating val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, y._2.toFloat))) .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER) case class ALSData(user:Int, item:Int, rating:Float) extends Serializable val ratingData = trainData.map(x => ALSData(x._1, x._2, x._3)).toDF() val als = new ALS val paramMap = ParamMap(als.alpha -> 25000). put(als.checkpointInterval, 5). put(als.implicitPrefs, true). put(als.itemCol, "item"). put(als.maxIter, 60). put(als.nonnegative, false).
[jira] [Updated] (SPARK-28913) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Wang updated SPARK-28913: --- Summary: ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances (was: ArrayIndexOutOfBoundsException in ALS for datasets with 12 billion instances) > ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets > with 12 billion instances > > > Key: SPARK-28913 > URL: https://issues.apache.org/jira/browse/SPARK-28913 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Assignee: Xiangrui Meng >Priority: Major > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > This exception happened sometimes. > Dataset capacity: ~12 billion ratings > Here is the our code: > {code:java} > val hivedata = sc.sql(sqltext).select(id,dpid,score).coalesce(numPartitions) > val predataItem = hivedata.rdd.map(r=>(r._1._1,(r._1._2,r._2.sum))) > .groupByKey().zipWithIndex() > .persist(StorageLevel.MEMORY_AND_DISK_SER) > val predataUser = > predataItem.flatMap(r=>r._1._2.map(y=>(y._1,(r._2.toInt,y._2 > .aggregateByKey(zeroValueArr,numPartitions)((a,b)=> a += b,(a,b)=>a ++ > b).map(r=>(r._1,r._2.toIterable)) > .zipWithIndex().persist(StorageLevel.MEMORY_AND_DISK_SER) > //x._2 is
[jira] [Updated] (SPARK-28913) ArrayIndexOutOfBoundsException in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Wang updated SPARK-28913: --- Description: The stack trace is below: {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 BlockManager: Block rdd_10916_493 could not be removed as it was not found on disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) java.lang.ArrayIndexOutOfBoundsException: 6741 at org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) at org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This exception happened sometimes. Dataset capacity: ~12 billion ratings Here is the our code: {code:java} val hivedata = sc.sql(sqltext).select(id,dpid,score).coalesce(numPartitions) val predataItem = hivedata.rdd.map(r=>(r._1._1,(r._1._2,r._2.sum))) .groupByKey().zipWithIndex() .persist(StorageLevel.MEMORY_AND_DISK_SER) val predataUser = predataItem.flatMap(r=>r._1._2.map(y=>(y._1,(r._2.toInt,y._2 .aggregateByKey(zeroValueArr,numPartitions)((a,b)=> a += b,(a,b)=>a ++ b).map(r=>(r._1,r._2.toIterable)) .zipWithIndex().persist(StorageLevel.MEMORY_AND_DISK_SER) //x._2 is the item_id, y._1 is the user_id, y._2 is the rating val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, y._2.toFloat))) .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER) case class ALSData(user:Int, item:Int, rating:Float) extends Serializable val ratingData = trainData.map(x => ALSData(x._1, x._2, x._3)).toDF() val als = new ALS val paramMap = ParamMap(als.alpha -> 25000). put(als.checkpointInterval, 5). put(als.implicitPrefs, true). put(als.itemCol, "item"). put(als.maxIter, 60). put(als.nonnegative, false). put(als.numItemBlocks, 600). put(als.numUserBlocks, 600). put(als.regParam, 4.5). put(als.rank, 25). put(als.userCol, "user") als.fit(ratingData, paramMap) {code} was: The stack trace is below: {quote}19/08/28 07:00:40 WARN
[jira] [Updated] (SPARK-28913) ArrayIndexOutOfBoundsException in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Wang updated SPARK-28913: --- Description: The stack trace is below: {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 BlockManager: Block rdd_10916_493 could not be removed as it was not found on disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) java.lang.ArrayIndexOutOfBoundsException: 6741 at org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) at org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This happened sometimes. Dataset capacity: ~12 billion ratings Here is the our code: {code:java} val hivedata = sc.sql(sqltext).select(id,dpid,score).coalesce(numPartitions) val predataItem = hivedata.rdd.map(r=>(r._1._1,(r._1._2,r._2.sum))) .groupByKey().zipWithIndex() .persist(StorageLevel.MEMORY_AND_DISK_SER) val predataUser = predataItem.flatMap(r=>r._1._2.map(y=>(y._1,(r._2.toInt,y._2 .aggregateByKey(zeroValueArr,numPartitions)((a,b)=> a += b,(a,b)=>a ++ b).map(r=>(r._1,r._2.toIterable)) .zipWithIndex().persist(StorageLevel.MEMORY_AND_DISK_SER) //x._2 is the item_id, y._1 is the user_id, y._2 is the rating val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, y._2.toFloat))) .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER) case class ALSData(user:Int, item:Int, rating:Float) extends Serializable val ratingData = trainData.map(x => ALSData(x._1, x._2, x._3)).toDF() val als = new ALS val paramMap = ParamMap(als.alpha -> 25000). put(als.checkpointInterval, 5). put(als.implicitPrefs, true). put(als.itemCol, "item"). put(als.maxIter, 60). put(als.nonnegative, false). put(als.numItemBlocks, 600). put(als.numUserBlocks, 600). put(als.regParam, 4.5). put(als.rank, 25). put(als.userCol, "user") als.fit(ratingData, paramMap) {code} was: The stack trace is below: {quote}19/08/28 07:00:40 WARN Executor task
[jira] [Updated] (SPARK-28913) ArrayIndexOutOfBoundsException in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Wang updated SPARK-28913: --- Summary: ArrayIndexOutOfBoundsException in ALS for datasets with 12 billion instances (was: ArrayIndexOutOfBoundsException in ALS for Large datasets) > ArrayIndexOutOfBoundsException in ALS for datasets with 12 billion instances > - > > Key: SPARK-28913 > URL: https://issues.apache.org/jira/browse/SPARK-28913 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Assignee: Xiangrui Meng >Priority: Major > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745){quote} > This happened after the dataset was sub-sampled. > Dataset properties: ~12B ratings > Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28913) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Wang updated SPARK-28913: --- Description: The stack trace is below: {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 BlockManager: Block rdd_10916_493 could not be removed as it was not found on disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) java.lang.ArrayIndexOutOfBoundsException: 6741 at org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) at org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745){quote} This happened after the dataset was sub-sampled. Dataset properties: ~12B ratings Setup: 55 r3.8xlarge ec2 instances was: The stack trace is below: {quote} java.lang.ArrayIndexOutOfBoundsException: 2716 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
[jira] [Updated] (SPARK-28913) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Wang updated SPARK-28913: --- Summary: ArrayIndexOutOfBoundsException in ALS for Large datasets (was: CLONE - ArrayIndexOutOfBoundsException in ALS for Large datasets) > ArrayIndexOutOfBoundsException in ALS for Large datasets > > > Key: SPARK-28913 > URL: https://issues.apache.org/jira/browse/SPARK-28913 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Assignee: Xiangrui Meng >Priority: Major > > The stack trace is below: > {quote} > java.lang.ArrayIndexOutOfBoundsException: 2716 > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > {quote} > This happened after the dataset was sub-sampled. > Dataset properties: ~12B ratings > Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28913) CLONE - ArrayIndexOutOfBoundsException in ALS for Large datasets
Qiang Wang created SPARK-28913: -- Summary: CLONE - ArrayIndexOutOfBoundsException in ALS for Large datasets Key: SPARK-28913 URL: https://issues.apache.org/jira/browse/SPARK-28913 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0, 1.2.0 Reporter: Qiang Wang Assignee: Xiangrui Meng The stack trace is below: {quote} java.lang.ArrayIndexOutOfBoundsException: 2716 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) {quote} This happened after the dataset was sub-sampled. Dataset properties: ~12B ratings Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28913) CLONE - ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Wang updated SPARK-28913: --- Affects Version/s: (was: 1.2.0) (was: 1.1.0) 2.2.1 > CLONE - ArrayIndexOutOfBoundsException in ALS for Large datasets > > > Key: SPARK-28913 > URL: https://issues.apache.org/jira/browse/SPARK-28913 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Assignee: Xiangrui Meng >Priority: Major > > The stack trace is below: > {quote} > java.lang.ArrayIndexOutOfBoundsException: 2716 > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > {quote} > This happened after the dataset was sub-sampled. > Dataset properties: ~12B ratings > Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28912) MatchError exception in CheckpointWriteHandler
Aleksandr Kashkirov created SPARK-28912: --- Summary: MatchError exception in CheckpointWriteHandler Key: SPARK-28912 URL: https://issues.apache.org/jira/browse/SPARK-28912 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.2, 2.3.0 Reporter: Aleksandr Kashkirov Setting checkpoint directory name to "checkpoint-" plus some digits (e.g. "checkpoint-01") results in the following error: {code:java} Exception in thread "pool-32-thread-1" scala.MatchError: 0523a434-0daa-4ea6-a050-c4eb3c557d8c (of class java.lang.String) at org.apache.spark.streaming.Checkpoint$.org$apache$spark$streaming$Checkpoint$$sortFunc$1(Checkpoint.scala:121) at org.apache.spark.streaming.Checkpoint$$anonfun$getCheckpointFiles$1.apply(Checkpoint.scala:132) at org.apache.spark.streaming.Checkpoint$$anonfun$getCheckpointFiles$1.apply(Checkpoint.scala:132) at scala.math.Ordering$$anon$9.compare(Ordering.scala:200) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) at java.util.TimSort.sort(TimSort.java:234) at java.util.Arrays.sort(Arrays.java:1438) at scala.collection.SeqLike$class.sorted(SeqLike.scala:648) at scala.collection.mutable.ArrayOps$ofRef.sorted(ArrayOps.scala:186) at scala.collection.SeqLike$class.sortWith(SeqLike.scala:601) at scala.collection.mutable.ArrayOps$ofRef.sortWith(ArrayOps.scala:186) at org.apache.spark.streaming.Checkpoint$.getCheckpointFiles(Checkpoint.scala:132) at org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28911) Unify Kafka source option pattern
[ https://issues.apache.org/jira/browse/SPARK-28911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wenxuanguan updated SPARK-28911: Description: Pattern of datasource options is Camel-Case, such as CheckpointLocation, and only some Kafka source option is separated with dot, Such as fetchOffset.numRetries. Also we can distinguish the Kafka original options from pattern, such as kafka.bootstrap.servers > Unify Kafka source option pattern > - > > Key: SPARK-28911 > URL: https://issues.apache.org/jira/browse/SPARK-28911 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: wenxuanguan >Priority: Major > > Pattern of datasource options is Camel-Case, such as CheckpointLocation, and > only some Kafka source option is separated with dot, Such as > fetchOffset.numRetries. > Also we can distinguish the Kafka original options from pattern, such as > kafka.bootstrap.servers -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28911) Unify Kafka source option pattern
wenxuanguan created SPARK-28911: --- Summary: Unify Kafka source option pattern Key: SPARK-28911 URL: https://issues.apache.org/jira/browse/SPARK-28911 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: wenxuanguan -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28910) Prevent schema verification when connecting to in memory derby
Juliusz Sompolski created SPARK-28910: - Summary: Prevent schema verification when connecting to in memory derby Key: SPARK-28910 URL: https://issues.apache.org/jira/browse/SPARK-28910 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.3 Reporter: Juliusz Sompolski When hive.metastore.schema.verification=true, HiveUtils.newClientForExecution fails with {{19/08/14 13:26:55 WARN Hive: Failed to access metastore. This class should not accessed in runtime. org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:186) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:143) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:290) at org.apache.spark.sql.hive.HiveUtils$.newClientForExecution(HiveUtils.scala:275) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.startWithContext(HiveThriftServer2.scala:58) ... Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient}} This prevents Thriftserver from starting -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28909) ANSI SQL: delete and update does not support in Spark
ABHISHEK KUMAR GUPTA created SPARK-28909: Summary: ANSI SQL: delete and update does not support in Spark Key: SPARK-28909 URL: https://issues.apache.org/jira/browse/SPARK-28909 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA delete and update supported in PostgresSQL create table emp_test(id int); insert into emp_test values(100); insert into emp_test values(200); select * from emp_test; *delete from emp_test where id=100;* select * from emp_test; *update emp_test set id=500 where id=200;* select * from emp_test; -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28340) Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file: java.nio.channels.ClosedByInterruptException
[ https://issues.apache.org/jira/browse/SPARK-28340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918386#comment-16918386 ] Saisai Shao commented on SPARK-28340: - My simple concern is that there may be other places which will potentially throw this "ClosedByInterruptException" during task killing, it seems hard to figure out all of them. > Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught > exception while reverting partial writes to file: > java.nio.channels.ClosedByInterruptException" > > > Key: SPARK-28340 > URL: https://issues.apache.org/jira/browse/SPARK-28340 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Minor > > If a Spark task is killed while writing blocks to disk (due to intentional > job kills, automated killing of redundant speculative tasks, etc) then Spark > may log exceptions like > {code:java} > 19/07/10 21:31:08 ERROR storage.DiskBlockObjectWriter: Uncaught exception > while reverting partial writes to file / > java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at sun.nio.ch.FileChannelImpl.truncate(FileChannelImpl.java:372) > at > org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:218) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1369) > at > org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:214) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:105) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > If {{BypassMergeSortShuffleWriter}} is being used then a single cancelled > task can result in hundreds of these stacktraces being logged. > Here are some StackOverflow questions asking about this: > * [https://stackoverflow.com/questions/40027870/spark-jobserver-job-crash] > * > [https://stackoverflow.com/questions/50646953/why-is-java-nio-channels-closedbyinterruptexceptio-called-when-caling-multiple] > * > [https://stackoverflow.com/questions/41867053/java-nio-channels-closedbyinterruptexception-in-spark] > * > [https://stackoverflow.com/questions/56845041/are-closedbyinterruptexception-exceptions-expected-when-spark-speculation-kills] > > Can we prevent this exception from occurring? If not, can we treat this > "expected exception" in a special manner to avoid log spam? My concern is > that the presence of large numbers of spurious exceptions is confusing to > users when they are inspecting Spark logs to diagnose other issues. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28908) Structured Streaming Kafka sink support Exactly-Once semantics
[ https://issues.apache.org/jira/browse/SPARK-28908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wenxuanguan updated SPARK-28908: Affects Version/s: (was: 2.4.3) 3.0.0 > Structured Streaming Kafka sink support Exactly-Once semantics > -- > > Key: SPARK-28908 > URL: https://issues.apache.org/jira/browse/SPARK-28908 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: wenxuanguan >Priority: Major > Attachments: Kafka Sink Exactly-Once Semantics Design Sketch.pdf > > > Since Apache Kafka supports transaction from 0.11.0.0, we can implement Kafka > sink exactly-once semantics with transaction Kafka producer -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28908) Structured Streaming Kafka sink support Exactly-Once semantics
[ https://issues.apache.org/jira/browse/SPARK-28908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wenxuanguan updated SPARK-28908: Attachment: Kafka Sink Exactly-Once Semantics Design Sketch.pdf > Structured Streaming Kafka sink support Exactly-Once semantics > -- > > Key: SPARK-28908 > URL: https://issues.apache.org/jira/browse/SPARK-28908 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: wenxuanguan >Priority: Major > Attachments: Kafka Sink Exactly-Once Semantics Design Sketch.pdf > > > Since Apache Kafka supports transaction from 0.11.0.0, we can implement Kafka > sink exactly-once semantics with transaction Kafka producer -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28908) Structured Streaming Kafka sink support Exactly-Once semantics
[ https://issues.apache.org/jira/browse/SPARK-28908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wenxuanguan updated SPARK-28908: Attachment: Kafka Sink Exactly-Once Semantics Design Sketch.pdf > Structured Streaming Kafka sink support Exactly-Once semantics > -- > > Key: SPARK-28908 > URL: https://issues.apache.org/jira/browse/SPARK-28908 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: wenxuanguan >Priority: Major > > Since Apache Kafka supports transaction from 0.11.0.0, we can implement Kafka > sink exactly-once semantics with transaction Kafka producer -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28908) Structured Streaming Kafka sink support Exactly-Once semantics
[ https://issues.apache.org/jira/browse/SPARK-28908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wenxuanguan updated SPARK-28908: Attachment: (was: Kafka Sink Exactly-Once Semantics Design Sketch.pdf) > Structured Streaming Kafka sink support Exactly-Once semantics > -- > > Key: SPARK-28908 > URL: https://issues.apache.org/jira/browse/SPARK-28908 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: wenxuanguan >Priority: Major > > Since Apache Kafka supports transaction from 0.11.0.0, we can implement Kafka > sink exactly-once semantics with transaction Kafka producer -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28908) Structured Streaming Kafka sink support Exactly-Once semantics
wenxuanguan created SPARK-28908: --- Summary: Structured Streaming Kafka sink support Exactly-Once semantics Key: SPARK-28908 URL: https://issues.apache.org/jira/browse/SPARK-28908 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.3 Reporter: wenxuanguan -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28908) Structured Streaming Kafka sink support Exactly-Once semantics
[ https://issues.apache.org/jira/browse/SPARK-28908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wenxuanguan updated SPARK-28908: Description: Since Apache Kafka supports transaction from 0.11.0.0, we can implement Kafka sink exactly-once semantics with transaction Kafka producer > Structured Streaming Kafka sink support Exactly-Once semantics > -- > > Key: SPARK-28908 > URL: https://issues.apache.org/jira/browse/SPARK-28908 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: wenxuanguan >Priority: Major > > Since Apache Kafka supports transaction from 0.11.0.0, we can implement Kafka > sink exactly-once semantics with transaction Kafka producer -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28678) Specify that start index is 1-based in docstring of pyspark.sql.functions.slice
[ https://issues.apache.org/jira/browse/SPARK-28678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918371#comment-16918371 ] Hyukjin Kwon commented on SPARK-28678: -- It doesn't have to assign someone. If you open a PR, then it will automatically assign. > Specify that start index is 1-based in docstring of > pyspark.sql.functions.slice > --- > > Key: SPARK-28678 > URL: https://issues.apache.org/jira/browse/SPARK-28678 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Sivam Pasupathipillai >Priority: Trivial > > The start index parameter in pyspark.sql.functions.slice should be 1-based, > otherwise the method fails with an exception. > This is not documented anywhere. Fix the docstring. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28907) Review invalid usage of new Configuration()
Xianjin YE created SPARK-28907: -- Summary: Review invalid usage of new Configuration() Key: SPARK-28907 URL: https://issues.apache.org/jira/browse/SPARK-28907 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.3 Reporter: Xianjin YE As [~LI,Xiao] mentioned in [https://github.com/apache/spark/pull/25002#pullrequestreview-279392994] there may be other incorrect usage of new Configuration() which leads to unexpected result. This jira tries to review these usage in Spark and replaces them if necessary. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28678) Specify that start index is 1-based in docstring of pyspark.sql.functions.slice
[ https://issues.apache.org/jira/browse/SPARK-28678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918341#comment-16918341 ] Ting Yang commented on SPARK-28678: --- I'd like to work on this, can you assign it to me. [~hyukjin.kwon] Thanks > Specify that start index is 1-based in docstring of > pyspark.sql.functions.slice > --- > > Key: SPARK-28678 > URL: https://issues.apache.org/jira/browse/SPARK-28678 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Sivam Pasupathipillai >Priority: Trivial > > The start index parameter in pyspark.sql.functions.slice should be 1-based, > otherwise the method fails with an exception. > This is not documented anywhere. Fix the docstring. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28340) Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file: java.nio.channels.ClosedByInterruptException
[ https://issues.apache.org/jira/browse/SPARK-28340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918306#comment-16918306 ] Saisai Shao commented on SPARK-28340: - We also saw a bunch of exceptions in our production environment. Looks like it is hard to prevent unless we change to not use `interrupt`, maybe we can just ignore logging such exceptions. > Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught > exception while reverting partial writes to file: > java.nio.channels.ClosedByInterruptException" > > > Key: SPARK-28340 > URL: https://issues.apache.org/jira/browse/SPARK-28340 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Minor > > If a Spark task is killed while writing blocks to disk (due to intentional > job kills, automated killing of redundant speculative tasks, etc) then Spark > may log exceptions like > {code:java} > 19/07/10 21:31:08 ERROR storage.DiskBlockObjectWriter: Uncaught exception > while reverting partial writes to file / > java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at sun.nio.ch.FileChannelImpl.truncate(FileChannelImpl.java:372) > at > org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:218) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1369) > at > org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:214) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:105) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > If {{BypassMergeSortShuffleWriter}} is being used then a single cancelled > task can result in hundreds of these stacktraces being logged. > Here are some StackOverflow questions asking about this: > * [https://stackoverflow.com/questions/40027870/spark-jobserver-job-crash] > * > [https://stackoverflow.com/questions/50646953/why-is-java-nio-channels-closedbyinterruptexceptio-called-when-caling-multiple] > * > [https://stackoverflow.com/questions/41867053/java-nio-channels-closedbyinterruptexception-in-spark] > * > [https://stackoverflow.com/questions/56845041/are-closedbyinterruptexception-exceptions-expected-when-spark-speculation-kills] > > Can we prevent this exception from occurring? If not, can we treat this > "expected exception" in a special manner to avoid log spam? My concern is > that the presence of large numbers of spurious exceptions is confusing to > users when they are inspecting Spark logs to diagnose other issues. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org