[jira] [Resolved] (SPARK-27193) CodeFormatter should format multi comment lines correctly
[ https://issues.apache.org/jira/browse/SPARK-27193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27193. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24133 [https://github.com/apache/spark/pull/24133] > CodeFormatter should format multi comment lines correctly > - > > Key: SPARK-27193 > URL: https://issues.apache.org/jira/browse/SPARK-27193 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: wuyi >Assignee: wuyi >Priority: Trivial > Fix For: 3.0.0 > > > when enable `spark.sql.codegen.comments`, there will be multiple comment > lines. However, CodeFormatter can not handle multi comment lines currently: > > Generated code: > /* 001 */ public Object generate(Object[] references) { > /* 002 */ return new GeneratedIteratorForCodegenStage1(references); > /* 003 */ } > /* 004 */ > /* 005 */ /** > \* Codegend pipeline for stage (id=1) > \* *(1) Project [(id#0L + 1) AS (id + 1)#3L] > \* +- *(1) Filter (id#0L = 1) > \*+- *(1) Range (0, 10, step=1, splits=4) > \*/ > /* 006 */ // codegenStageId=1 > /* 007 */ final class GeneratedIteratorForCodegenStage1 extends > org.apache.spark.sql.execution.BufferedRowIterator { -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27193) CodeFormatter should format multi comment lines correctly
[ https://issues.apache.org/jira/browse/SPARK-27193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27193: --- Assignee: wuyi > CodeFormatter should format multi comment lines correctly > - > > Key: SPARK-27193 > URL: https://issues.apache.org/jira/browse/SPARK-27193 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: wuyi >Assignee: wuyi >Priority: Trivial > > when enable `spark.sql.codegen.comments`, there will be multiple comment > lines. However, CodeFormatter can not handle multi comment lines currently: > > Generated code: > /* 001 */ public Object generate(Object[] references) { > /* 002 */ return new GeneratedIteratorForCodegenStage1(references); > /* 003 */ } > /* 004 */ > /* 005 */ /** > \* Codegend pipeline for stage (id=1) > \* *(1) Project [(id#0L + 1) AS (id + 1)#3L] > \* +- *(1) Filter (id#0L = 1) > \*+- *(1) Range (0, 10, step=1, splits=4) > \*/ > /* 006 */ // codegenStageId=1 > /* 007 */ final class GeneratedIteratorForCodegenStage1 extends > org.apache.spark.sql.execution.BufferedRowIterator { -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27162) Add new method getOriginalMap in CaseInsensitiveStringMap
[ https://issues.apache.org/jira/browse/SPARK-27162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27162. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24094 [https://github.com/apache/spark/pull/24094] > Add new method getOriginalMap in CaseInsensitiveStringMap > - > > Key: SPARK-27162 > URL: https://issues.apache.org/jira/browse/SPARK-27162 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > Currently, DataFrameReader/DataFrameReader supports setting Hadoop > configurations via method `.option()`. > E.g. > ``` > class TestFileFilter extends PathFilter { > override def accept(path: Path): Boolean = path.getParent.getName != "p=2" > } > withTempPath { dir => > val path = dir.getCanonicalPath > val df = spark.range(2) > df.write.orc(path + "/p=1") > df.write.orc(path + "/p=2") > assert(spark.read.orc(path).count() === 4) > val extraOptions = Map( > "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName, > "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName > ) > assert(spark.read.options(extraOptions).orc(path).count() === 2) > } > ``` > While Hadoop Configurations are case sensitive, the current data source V2 > APIs are using `CaseInsensitiveStringMap` in TableProvider. > To create Hadoop configurations correctly, I suggest adding a method > `getOriginalMap` in `CaseInsensitiveStringMap`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27162) Add new method getOriginalMap in CaseInsensitiveStringMap
[ https://issues.apache.org/jira/browse/SPARK-27162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27162: --- Assignee: Gengliang Wang > Add new method getOriginalMap in CaseInsensitiveStringMap > - > > Key: SPARK-27162 > URL: https://issues.apache.org/jira/browse/SPARK-27162 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, DataFrameReader/DataFrameReader supports setting Hadoop > configurations via method `.option()`. > E.g. > ``` > class TestFileFilter extends PathFilter { > override def accept(path: Path): Boolean = path.getParent.getName != "p=2" > } > withTempPath { dir => > val path = dir.getCanonicalPath > val df = spark.range(2) > df.write.orc(path + "/p=1") > df.write.orc(path + "/p=2") > assert(spark.read.orc(path).count() === 4) > val extraOptions = Map( > "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName, > "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName > ) > assert(spark.read.options(extraOptions).orc(path).count() === 2) > } > ``` > While Hadoop Configurations are case sensitive, the current data source V2 > APIs are using `CaseInsensitiveStringMap` in TableProvider. > To create Hadoop configurations correctly, I suggest adding a method > `getOriginalMap` in `CaseInsensitiveStringMap`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27198) Heartbeat interval mismatch in driver and executor
[ https://issues.apache.org/jira/browse/SPARK-27198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27198: Assignee: Apache Spark > Heartbeat interval mismatch in driver and executor > -- > > Key: SPARK-27198 > URL: https://issues.apache.org/jira/browse/SPARK-27198 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: Ajith S >Assignee: Apache Spark >Priority: Major > > When heartbeat interval is configured via *spark.executor.heartbeatInterval* > without specifying units, we have time mismatched between driver(considers in > seconds) and executor(considers as milliseconds) > > [https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/SparkConf.scala#L613] > vs > [https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L858] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27198) Heartbeat interval mismatch in driver and executor
[ https://issues.apache.org/jira/browse/SPARK-27198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27198: Assignee: (was: Apache Spark) > Heartbeat interval mismatch in driver and executor > -- > > Key: SPARK-27198 > URL: https://issues.apache.org/jira/browse/SPARK-27198 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: Ajith S >Priority: Major > > When heartbeat interval is configured via *spark.executor.heartbeatInterval* > without specifying units, we have time mismatched between driver(considers in > seconds) and executor(considers as milliseconds) > > [https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/SparkConf.scala#L613] > vs > [https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L858] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27198) Heartbeat interval mismatch in driver and executor
[ https://issues.apache.org/jira/browse/SPARK-27198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795666#comment-16795666 ] Ajith S commented on SPARK-27198: - will be working on this > Heartbeat interval mismatch in driver and executor > -- > > Key: SPARK-27198 > URL: https://issues.apache.org/jira/browse/SPARK-27198 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: Ajith S >Priority: Major > > When heartbeat interval is configured via *spark.executor.heartbeatInterval* > without specifying units, we have time mismatched between driver(considers in > seconds) and executor(considers as milliseconds) > > [https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/SparkConf.scala#L613] > vs > [https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L858] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27198) Heartbeat interval mismatch in driver and executor
Ajith S created SPARK-27198: --- Summary: Heartbeat interval mismatch in driver and executor Key: SPARK-27198 URL: https://issues.apache.org/jira/browse/SPARK-27198 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0, 2.3.3 Reporter: Ajith S When heartbeat interval is configured via *spark.executor.heartbeatInterval* without specifying units, we have time mismatched between driver(considers in seconds) and executor(considers as milliseconds) [https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/SparkConf.scala#L613] vs [https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L858] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27195) Add AvroReadSchemaSuite
[ https://issues.apache.org/jira/browse/SPARK-27195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27195. --- Resolution: Fixed Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24135 > Add AvroReadSchemaSuite > --- > > Key: SPARK-27195 > URL: https://issues.apache.org/jira/browse/SPARK-27195 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > > The reader schema is said to be evolved (or projected) when it changed after > the data is written by writers. Apache Spark file-based data sources have a > test coverage for that. This issue aims to add `AvroReadSchemaSuite` to > ensure the minimal consistency among file-based data sources and prevent a > future regression in Avro data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27195) Add AvroReadSchemaSuite
[ https://issues.apache.org/jira/browse/SPARK-27195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-27195: - Assignee: Dongjoon Hyun > Add AvroReadSchemaSuite > --- > > Key: SPARK-27195 > URL: https://issues.apache.org/jira/browse/SPARK-27195 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > The reader schema is said to be evolved (or projected) when it changed after > the data is written by writers. Apache Spark file-based data sources have a > test coverage for that. This issue aims to add `AvroReadSchemaSuite` to > ensure the minimal consistency among file-based data sources and prevent a > future regression in Avro data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27195) Add AvroReadSchemaSuite
[ https://issues.apache.org/jira/browse/SPARK-27195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27195: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-25603 > Add AvroReadSchemaSuite > --- > > Key: SPARK-27195 > URL: https://issues.apache.org/jira/browse/SPARK-27195 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > The reader schema is said to be evolved (or projected) when it changed after > the data is written by writers. Apache Spark file-based data sources have a > test coverage for that. This issue aims to add `AvroReadSchemaSuite` to > ensure the minimal consistency among file-based data sources and prevent a > future regression in Avro data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27197) Add ReadNestedSchemaTest for file-based data sources
[ https://issues.apache.org/jira/browse/SPARK-27197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27197: Assignee: Apache Spark > Add ReadNestedSchemaTest for file-based data sources > > > Key: SPARK-27197 > URL: https://issues.apache.org/jira/browse/SPARK-27197 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > This issue adds a test coverage into the schema evolution suite for adding > nested column and hiding nested columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27197) Add ReadNestedSchemaTest for file-based data sources
[ https://issues.apache.org/jira/browse/SPARK-27197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-27197: - Assignee: Dongjoon Hyun > Add ReadNestedSchemaTest for file-based data sources > > > Key: SPARK-27197 > URL: https://issues.apache.org/jira/browse/SPARK-27197 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > This issue adds a test coverage into the schema evolution suite for adding > nested column and hiding nested columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27197) Add ReadNestedSchemaTest for file-based data sources
[ https://issues.apache.org/jira/browse/SPARK-27197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27197: Assignee: (was: Apache Spark) > Add ReadNestedSchemaTest for file-based data sources > > > Key: SPARK-27197 > URL: https://issues.apache.org/jira/browse/SPARK-27197 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue adds a test coverage into the schema evolution suite for adding > nested column and hiding nested columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27197) Add ReadNestedSchemaTest for file-based data sources
Dongjoon Hyun created SPARK-27197: - Summary: Add ReadNestedSchemaTest for file-based data sources Key: SPARK-27197 URL: https://issues.apache.org/jira/browse/SPARK-27197 Project: Spark Issue Type: Sub-task Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Dongjoon Hyun This issue adds a test coverage into the schema evolution suite for adding nested column and hiding nested columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files
[ https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795575#comment-16795575 ] Ajith S commented on SPARK-27194: - Currently looks like from logs the file name for task 200.0 and 200.1(reattempt) expected file name to be, part-00200-blah-blah.c000.snappy.parquet. (refer org.apache.spark.internal.io.HadoopMapReduceCommitProtocol#getFilename) May be we should have taskId_attemptId in the part file name so that rerun tasks do not conflict with older failed tasks. cc [~srowen] [~cloud_fan] [~dongjoon] any thoughts.? > Job failures when task attempts do not clean up spark-staging parquet files > --- > > Key: SPARK-27194 > URL: https://issues.apache.org/jira/browse/SPARK-27194 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.3.1, 2.3.2 >Reporter: Reza Safi >Priority: Major > > When a container fails for some reason (for example when killed by yarn for > exceeding memory limits), the subsequent task attempts for the tasks that > were running on that container all fail with a FileAlreadyExistsException. > The original task attempt does not seem to successfully call abortTask (or at > least its "best effort" delete is unsuccessful) and clean up the parquet file > it was writing to, so when later task attempts try to write to the same > spark-staging directory using the same file name, the job fails. > Here is what transpires in the logs: > The container where task 200.0 is running is killed and the task is lost: > 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on > t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 > GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. > 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage > 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 > exited caused by one of the running tasks) Reason: Container killed by YARN > for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > The task is re-attempted on a different executor and fails because the > part-00200-blah-blah.c000.snappy.parquet file from the first task attempt > already exists: > 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 > (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client 17.161.235.91 already exists > The job fails when the the configured task attempts (spark.task.maxFailures) > have failed with the same error: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 > in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage > 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > ... > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client i.p.a.d already exists > > SPARK-26682 wasn't the root cause here, since there wasn't any stage > reattempt. > This issue seems to happen when > spark.sql.sources.partitionOverwriteMode=dynamic. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-
[jira] [Updated] (SPARK-27192) spark.task.cpus should be less or equal than spark.task.cpus when use static executor allocation
[ https://issues.apache.org/jira/browse/SPARK-27192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lijia Liu updated SPARK-27192: -- Description: When use dynamic executor allocation, if we set spark.executor.cores small than spark.task.cpus, exception will be thrown as follows: '''spark.executor.cores must not be < spark.task.cpus''' But, if dynamic executor allocation not enabled, spark will hang when submit new job for TaskSchedulerImpl will not schedule a task in a executor which available cores is small than spark.task.cpus.See [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L351] So, when start task scheduler, spark.task.cpus should be check. reproduce $SPARK_HOME/bin/spark-shell --conf spark.task.cpus=2 --master local[1] scala> sc.parallelize(1 to 9).collect was: When use dynamic executor allocation, if we set spark.executor.cores small than spark.task.cpus, exception will be thrown as follows: '''spark.executor.cores must not be < spark.task.cpus''' But, if dynamic executor allocation not enabled, spark will hang when submit new job for TaskSchedulerImpl will not schedule a task in a executor which available cores is small than spark.task.cpus.See [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L351] So, when start task scheduler, spark.task.cpus should be check. > spark.task.cpus should be less or equal than spark.task.cpus when use static > executor allocation > > > Key: SPARK-27192 > URL: https://issues.apache.org/jira/browse/SPARK-27192 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0, 2.4.0 >Reporter: Lijia Liu >Priority: Major > > When use dynamic executor allocation, if we set spark.executor.cores small > than spark.task.cpus, exception will be thrown as follows: > '''spark.executor.cores must not be < spark.task.cpus''' > But, if dynamic executor allocation not enabled, spark will hang when submit > new job for TaskSchedulerImpl will not schedule a task in a executor which > available cores is small than > spark.task.cpus.See > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L351] > So, when start task scheduler, spark.task.cpus should be check. > reproduce > $SPARK_HOME/bin/spark-shell --conf spark.task.cpus=2 --master local[1] > scala> sc.parallelize(1 to 9).collect -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27188) FileStreamSink: provide a new option to have retention on output files
[ https://issues.apache.org/jira/browse/SPARK-27188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-27188: - Summary: FileStreamSink: provide a new option to have retention on output files (was: FileStreamSink: provide a new option to disable metadata log) > FileStreamSink: provide a new option to have retention on output files > -- > > Key: SPARK-27188 > URL: https://issues.apache.org/jira/browse/SPARK-27188 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > From SPARK-24295 we indicated various end users are struggling with dealing > with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary > readers which leverage metadata log to determine which files are safely read > (to ensure 'exactly-once'), pruning metadata log is not trivial to implement. > While we may be able to deal with checking deleted output files in > FileStreamSink and get rid of them when compacting metadata, that operation > would take additional overhead for running query. (I'll try to address this > via another issue though.) > Back to the issue, 'exactly-once' via leveraging metadata is only possible > when output directory is being read by Spark, and for other cases it should > provide less guarantee. I think we could provide this as a workaround to > mitigate such issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
[ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795519#comment-16795519 ] Jungtaek Lim commented on SPARK-24295: -- [~iqbal_khattra] [~alfredo-gimenez-bv] I would be really appreciated if you could review SPARK-27188 and see whether it works for your cases. Thanks in advance! > Purge Structured streaming FileStreamSinkLog metadata compact file data. > > > Key: SPARK-24295 > URL: https://issues.apache.org/jira/browse/SPARK-24295 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Iqbal Singh >Priority: Major > Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz > > > FileStreamSinkLog metadata logs are concatenated to a single compact file > after defined compact interval. > For long running jobs, compact file size can grow up to 10's of GB's, Causing > slowness while reading the data from FileStreamSinkLog dir as spark is > defaulting to the "__spark__metadata" dir for the read. > We need a functionality to purge the compact file size. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27188) FileStreamSink: provide a new option to have retention on output files
[ https://issues.apache.org/jira/browse/SPARK-27188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-27188: - Description: >From SPARK-24295 we indicated various end users are struggling with dealing >with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary >readers which leverage metadata log to determine which files are safely read >(to ensure 'exactly-once'), pruning metadata log is not trivial to implement. While we may be able to deal with checking deleted output files in FileStreamSink and get rid of them when compacting metadata, that operation would take additional overhead for running query. (I'll try to address this via another issue though.) We can still get time-to-live (TTL) of output files from end users, and filter out files in metadata so that metadata is not growing linearly. Also filtered out files will be no longer seen in reader queries which leverage File(Stream)Source. was: >From SPARK-24295 we indicated various end users are struggling with dealing >with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary >readers which leverage metadata log to determine which files are safely read >(to ensure 'exactly-once'), pruning metadata log is not trivial to implement. While we may be able to deal with checking deleted output files in FileStreamSink and get rid of them when compacting metadata, that operation would take additional overhead for running query. (I'll try to address this via another issue though.) Back to the issue, 'exactly-once' via leveraging metadata is only possible when output directory is being read by Spark, and for other cases it should provide less guarantee. I think we could provide this as a workaround to mitigate such issue. > FileStreamSink: provide a new option to have retention on output files > -- > > Key: SPARK-27188 > URL: https://issues.apache.org/jira/browse/SPARK-27188 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > From SPARK-24295 we indicated various end users are struggling with dealing > with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary > readers which leverage metadata log to determine which files are safely read > (to ensure 'exactly-once'), pruning metadata log is not trivial to implement. > While we may be able to deal with checking deleted output files in > FileStreamSink and get rid of them when compacting metadata, that operation > would take additional overhead for running query. (I'll try to address this > via another issue though.) > We can still get time-to-live (TTL) of output files from end users, and > filter out files in metadata so that metadata is not growing linearly. Also > filtered out files will be no longer seen in reader queries which leverage > File(Stream)Source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27178) k8s test failing due to missing nss library in dockerfile
[ https://issues.apache.org/jira/browse/SPARK-27178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795494#comment-16795494 ] shane knapp commented on SPARK-27178: - [~vanzin] just merged https://github.com/apache/spark/pull/24137 marking as resolved... we should be g2g for now. i'll create a new jira to discuss the potential pinning of these dockerfiles to a specific image version. > k8s test failing due to missing nss library in dockerfile > - > > Key: SPARK-27178 > URL: https://issues.apache.org/jira/browse/SPARK-27178 > Project: Spark > Issue Type: Bug > Components: Build, jenkins, Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Fix For: 2.4.2, 3.0.0 > > > while performing some tests on our existing minikube and k8s infrastructure, > i noticed that the integration tests were failing. i dug in and discovered > the following message buried at the end of the stacktrace: > {noformat} > Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so > at sun.security.pkcs11.Secmod.initialize(Secmod.java:193) > at sun.security.pkcs11.SunPKCS11.(SunPKCS11.java:218) > ... 81 more > {noformat} > after i added the 'nss' package to > resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile, > everything worked. > i will also check and see if this is failing on 2.4... > tbh, i have no idea why this literally started failing today and not earlier. > the only recent change to this file that i can find is > https://issues.apache.org/jira/browse/SPARK-26995 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27178) k8s test failing due to missing nss library in dockerfile
[ https://issues.apache.org/jira/browse/SPARK-27178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795494#comment-16795494 ] shane knapp edited comment on SPARK-27178 at 3/18/19 11:54 PM: --- [~vanzin] just merged https://github.com/apache/spark/pull/24137 we should be g2g for now. i'll create a new jira to discuss the potential pinning of these dockerfiles to a specific image version. was (Author: shaneknapp): [~vanzin] just merged https://github.com/apache/spark/pull/24137 marking as resolved... we should be g2g for now. i'll create a new jira to discuss the potential pinning of these dockerfiles to a specific image version. > k8s test failing due to missing nss library in dockerfile > - > > Key: SPARK-27178 > URL: https://issues.apache.org/jira/browse/SPARK-27178 > Project: Spark > Issue Type: Bug > Components: Build, jenkins, Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Fix For: 2.4.2, 3.0.0 > > > while performing some tests on our existing minikube and k8s infrastructure, > i noticed that the integration tests were failing. i dug in and discovered > the following message buried at the end of the stacktrace: > {noformat} > Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so > at sun.security.pkcs11.Secmod.initialize(Secmod.java:193) > at sun.security.pkcs11.SunPKCS11.(SunPKCS11.java:218) > ... 81 more > {noformat} > after i added the 'nss' package to > resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile, > everything worked. > i will also check and see if this is failing on 2.4... > tbh, i have no idea why this literally started failing today and not earlier. > the only recent change to this file that i can find is > https://issues.apache.org/jira/browse/SPARK-26995 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27178) k8s test failing due to missing nss library in dockerfile
[ https://issues.apache.org/jira/browse/SPARK-27178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27178. Resolution: Fixed Fix Version/s: 3.0.0 2.4.2 > k8s test failing due to missing nss library in dockerfile > - > > Key: SPARK-27178 > URL: https://issues.apache.org/jira/browse/SPARK-27178 > Project: Spark > Issue Type: Bug > Components: Build, jenkins, Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Fix For: 2.4.2, 3.0.0 > > > while performing some tests on our existing minikube and k8s infrastructure, > i noticed that the integration tests were failing. i dug in and discovered > the following message buried at the end of the stacktrace: > {noformat} > Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so > at sun.security.pkcs11.Secmod.initialize(Secmod.java:193) > at sun.security.pkcs11.SunPKCS11.(SunPKCS11.java:218) > ... 81 more > {noformat} > after i added the 'nss' package to > resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile, > everything worked. > i will also check and see if this is failing on 2.4... > tbh, i have no idea why this literally started failing today and not earlier. > the only recent change to this file that i can find is > https://issues.apache.org/jira/browse/SPARK-26995 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27178) k8s test failing due to missing nss library in dockerfile
[ https://issues.apache.org/jira/browse/SPARK-27178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795487#comment-16795487 ] shane knapp commented on SPARK-27178: - https://github.com/apache/spark/pull/24111 merged to master. holding off on the 2.4 fix. > k8s test failing due to missing nss library in dockerfile > - > > Key: SPARK-27178 > URL: https://issues.apache.org/jira/browse/SPARK-27178 > Project: Spark > Issue Type: Bug > Components: Build, jenkins, Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > while performing some tests on our existing minikube and k8s infrastructure, > i noticed that the integration tests were failing. i dug in and discovered > the following message buried at the end of the stacktrace: > {noformat} > Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so > at sun.security.pkcs11.Secmod.initialize(Secmod.java:193) > at sun.security.pkcs11.SunPKCS11.(SunPKCS11.java:218) > ... 81 more > {noformat} > after i added the 'nss' package to > resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile, > everything worked. > i will also check and see if this is failing on 2.4... > tbh, i have no idea why this literally started failing today and not earlier. > the only recent change to this file that i can find is > https://issues.apache.org/jira/browse/SPARK-26995 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20787) PySpark can't handle datetimes before 1900
[ https://issues.apache.org/jira/browse/SPARK-20787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795477#comment-16795477 ] AdrianC commented on SPARK-20787: - Hi, We are seeing this issue in pyspark 2.2.1/ python 2.7 while trying to port a legacy system to pyspark (AWS Glue). I'm trying to follow the pull request comments, but am I to understand that this issue is not being fixed right now? > PySpark can't handle datetimes before 1900 > -- > > Key: SPARK-20787 > URL: https://issues.apache.org/jira/browse/SPARK-20787 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0, 2.1.1 >Reporter: Keith Bourgoin >Priority: Major > > When trying to put a datetime before 1900 into a DataFrame, it throws an > error because of the use of time.mktime. > {code} > Python 2.7.13 (default, Mar 8 2017, 17:29:55) > Type "copyright", "credits" or "license" for more information. > IPython 5.3.0 -- An enhanced Interactive Python. > ? -> Introduction and overview of IPython's features. > %quickref -> Quick reference. > help -> Python's own help system. > object? -> Details about 'object', use 'object??' for extra details. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/05/17 12:45:59 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/05/17 12:46:02 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Mar 8 2017 17:29:55) > SparkSession available as 'spark'. > In [1]: import datetime as dt > In [2]: > sqlContext.createDataFrame(sc.parallelize([[dt.datetime(1899,12,31)]])).count() > 17/05/17 12:46:16 ERROR Executor: Exception in task 3.0 in stage 2.0 (TID 7) > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 174, in main > process() > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 169, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 268, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/home/kfb/src/projects/spark/python/pyspark/sql/types.py", line 576, > in toInternal > return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) > File "/home/kfb/src/projects/spark/python/pyspark/sql/types.py", line 576, > in > return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/sql/types.py", > line 436, in toInternal > return self.dataType.toInternal(obj) > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/sql/types.py", > line 191, in toInternal > else time.mktime(dt.timetuple())) > ValueError: year out of range > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) > at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.
[jira] [Created] (SPARK-27196) Beginning offset 115204574 is after the ending offset 115204516 for topic
Prasanna Talakanti created SPARK-27196: -- Summary: Beginning offset 115204574 is after the ending offset 115204516 for topic Key: SPARK-27196 URL: https://issues.apache.org/jira/browse/SPARK-27196 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.3.0 Environment: Spark : 2.3.0 Sparks Kafka: spark-streaming-kafka-0-10_2.3.0 Kafka Client: org.apache.kafka.kafka-clients: 0.11.0.1 Reporter: Prasanna Talakanti We are getting this issue in production and Sparks consumer dying because of Off Set issue. We observed the following error in Kafka Broker -- [2019-03-18 14:40:14,100] WARN Unable to reconnect to ZooKeeper service, session 0x1692e9ff4410004 has expired (org.apache.zookeeper.ClientCnxn) [2019-03-18 14:40:14,100] INFO Unable to reconnect to ZooKeeper service, session 0x1692e9ff4410004 has expired, closing socket connection (org.apache.zook eeper.ClientCnxn) --- Sparks Job died with the following error: ERROR 2019-03-18 07:40:57,178 7924 org.apache.spark.executor.Executor [Executor task launch worker for task 16] Exception in task 27.0 in stage 0.0 (TID 16) java.lang.AssertionError: assertion failed: Beginning offset 115204574 is after the ending offset 115204516 for topic partition 37. You either provided an invalid fromOffset, or the Kafka topic has been damaged at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:175) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27178) k8s test failing due to missing nss library in dockerfile
[ https://issues.apache.org/jira/browse/SPARK-27178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795414#comment-16795414 ] Apache Spark commented on SPARK-27178: -- User 'shaneknapp' has created a pull request for this issue: https://github.com/apache/spark/pull/24137 > k8s test failing due to missing nss library in dockerfile > - > > Key: SPARK-27178 > URL: https://issues.apache.org/jira/browse/SPARK-27178 > Project: Spark > Issue Type: Bug > Components: Build, jenkins, Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > while performing some tests on our existing minikube and k8s infrastructure, > i noticed that the integration tests were failing. i dug in and discovered > the following message buried at the end of the stacktrace: > {noformat} > Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so > at sun.security.pkcs11.Secmod.initialize(Secmod.java:193) > at sun.security.pkcs11.SunPKCS11.(SunPKCS11.java:218) > ... 81 more > {noformat} > after i added the 'nss' package to > resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile, > everything worked. > i will also check and see if this is failing on 2.4... > tbh, i have no idea why this literally started failing today and not earlier. > the only recent change to this file that i can find is > https://issues.apache.org/jira/browse/SPARK-26995 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26844) Parquet Reader exception - ArrayIndexOutOfBound should give more information to user
[ https://issues.apache.org/jira/browse/SPARK-26844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795382#comment-16795382 ] nirav patel commented on SPARK-26844: - [~hyukjin.kwon] Yes, Paruqet file is corrupt. It has newline character in field value somewhere. Is it possible for spark to provide more information when it fails to read such file? > Parquet Reader exception - ArrayIndexOutOfBound should give more information > to user > > > Key: SPARK-26844 > URL: https://issues.apache.org/jira/browse/SPARK-26844 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: nirav patel >Priority: Minor > > I get following error while reading parquet file which has primitive > datatypes (INT32, binary) > > > spark.read.format("parquet").load(path).show() // error happens here > > Caused by: java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putBytes(OnHeapColumnVector.java:163) > at > org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:733) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:410) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:419) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:203) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > > > Point if ArrayIndexOutOfBoundsException raised on a column/field spark > should say what particular column/field it is. it helps in troubleshoot. > > e.g. I get following error while reading same file using Drill reader. > org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: > Error reading page data File: /.../../part-00016-0-m-00016.parquet > *Column: GROUP_NAME* Row Group Start: 5539 Fragment 0:0 > I also get more specific information in Drillbit.log -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26844) Parquet Reader exception - ArrayIndexOutOfBound should give more information to user
[ https://issues.apache.org/jira/browse/SPARK-26844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nirav patel updated SPARK-26844: Description: I get following error while reading parquet file which has primitive datatypes (INT32, binary) Parquet file is potentially corrupt. It has newline character in some field value. spark.read.format("parquet").load(path).show() // error happens here Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putBytes(OnHeapColumnVector.java:163) at org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:733) at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:410) at org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167) at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:419) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:203) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) Point if ArrayIndexOutOfBoundsException raised on a column/field spark should say what particular column/field it is. it helps in troubleshoot. e.g. I get following error while reading same file using Drill reader. org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: Error reading page data File: /.../../part-00016-0-m-00016.parquet *Column: GROUP_NAME* Row Group Start: 5539 Fragment 0:0 I also get more specific information in Drillbit.log was: I get following error while reading parquet file which has primitive datatypes (INT32, binary) spark.read.format("parquet").load(path).show() // error happens here Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putBytes(OnHeapColumnVector.java:163) at org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:733) at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:410) at org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167) at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:419) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.r
[jira] [Assigned] (SPARK-27088) Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in RuleExecutor
[ https://issues.apache.org/jira/browse/SPARK-27088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27088: Assignee: Apache Spark > Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in > RuleExecutor > - > > Key: SPARK-27088 > URL: https://issues.apache.org/jira/browse/SPARK-27088 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maryann Xue >Assignee: Apache Spark >Priority: Minor > > Similar to SPARK-25415, which has made log level for plan changes by each > rule configurable, we can make log level for plan changes by each batch > configurable too and can reuse the same configuration: > "spark.sql.optimizer.planChangeLog.level". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27088) Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in RuleExecutor
[ https://issues.apache.org/jira/browse/SPARK-27088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27088: Assignee: (was: Apache Spark) > Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in > RuleExecutor > - > > Key: SPARK-27088 > URL: https://issues.apache.org/jira/browse/SPARK-27088 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maryann Xue >Priority: Minor > > Similar to SPARK-25415, which has made log level for plan changes by each > rule configurable, we can make log level for plan changes by each batch > configurable too and can reuse the same configuration: > "spark.sql.optimizer.planChangeLog.level". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27195) Add AvroReadSchemaSuite
[ https://issues.apache.org/jira/browse/SPARK-27195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27195: Assignee: (was: Apache Spark) > Add AvroReadSchemaSuite > --- > > Key: SPARK-27195 > URL: https://issues.apache.org/jira/browse/SPARK-27195 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > The reader schema is said to be evolved (or projected) when it changed after > the data is written by writers. Apache Spark file-based data sources have a > test coverage for that. This issue aims to add `AvroReadSchemaSuite` to > ensure the minimal consistency among file-based data sources and prevent a > future regression in Avro data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27195) Add AvroReadSchemaSuite
[ https://issues.apache.org/jira/browse/SPARK-27195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27195: Assignee: Apache Spark > Add AvroReadSchemaSuite > --- > > Key: SPARK-27195 > URL: https://issues.apache.org/jira/browse/SPARK-27195 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > The reader schema is said to be evolved (or projected) when it changed after > the data is written by writers. Apache Spark file-based data sources have a > test coverage for that. This issue aims to add `AvroReadSchemaSuite` to > ensure the minimal consistency among file-based data sources and prevent a > future regression in Avro data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27195) Add AvroReadSchemaSuite
Dongjoon Hyun created SPARK-27195: - Summary: Add AvroReadSchemaSuite Key: SPARK-27195 URL: https://issues.apache.org/jira/browse/SPARK-27195 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Dongjoon Hyun The reader schema is said to be evolved (or projected) when it changed after the data is written by writers. Apache Spark file-based data sources have a test coverage for that. This issue aims to add `AvroReadSchemaSuite` to ensure the minimal consistency among file-based data sources and prevent a future regression in Avro data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files
[ https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Safi updated SPARK-27194: -- Description: When a container fails for some reason (for example when killed by yarn for exceeding memory limits), the subsequent task attempts for the tasks that were running on that container all fail with a FileAlreadyExistsException. The original task attempt does not seem to successfully call abortTask (or at least its "best effort" delete is unsuccessful) and clean up the parquet file it was writing to, so when later task attempts try to write to the same spark-staging directory using the same file name, the job fails. Here is what transpires in the logs: The container where task 200.0 is running is killed and the task is lost: 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. The task is re-attempted on a different executor and fails because the part-00200-blah-blah.c000.snappy.parquet file from the first task attempt already exists: 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet for client 17.161.235.91 already exists The job fails when the the configured task attempts (spark.task.maxFailures) have failed with the same error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) ... Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet for client i.p.a.d already exists SPARK-26682 wasn't the root cause here, since there wasn't any stage reattempt. This issue seems to happen when spark.sql.sources.partitionOverwriteMode=dynamic. was: When a container fails for some reason (for example when killed by yarn for exceeding memory limits), the subsequent task attempts for the tasks that were running on that container all fail with a FileAlreadyExistsException. The original task attempt does not seem to successfully call abortTask (or at least its "best effort" delete is unsuccessful) and clean up the parquet file it was writing to, so when later task attempts try to write to the same spark-staging directory using the same file name, the job fails. Here is what transpires in the logs: The container where task 200.0 is running is killed and the task is lost: 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. The task is re-attempted on
[jira] [Updated] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files
[ https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Safi updated SPARK-27194: -- Description: When a container fails for some reason (for example when killed by yarn for exceeding memory limits), the subsequent task attempts for the tasks that were running on that container all fail with a FileAlreadyExistsException. The original task attempt does not seem to successfully call abortTask (or at least its "best effort" delete is unsuccessful) and clean up the parquet file it was writing to, so when later task attempts try to write to the same spark-staging directory using the same file name, the job fails. Here is what transpires in the logs: The container where task 200.0 is running is killed and the task is lost: 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. The task is re-attempted on a different executor and fails because the part-00200-blah-blah.c000.snappy.parquet file from the first task attempt already exists: 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet for client 17.161.235.91 already exists The job fails when the the configured task attempts (spark.task.maxFailures) have failed with the same error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) ... Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet for client i.p.a.d already exists SPARK-26682 wasn't the root cause here, since there wasn't any stage reattempt. This seems that happens when spark.sql.sources.partitionOverwriteMode=dynamic. was: When a container fails for some reason (for example when killed by yarn for exceeding memory limits), the subsequent task attempts for the tasks that were running on that container all fail with a FileAlreadyExistsException. The original task attempt does not seem to successfully call abortTask (or at least its "best effort" delete is unsuccessful) and clean up the parquet file it was writing to, so when later task attempts try to write to the same spark-staging directory using the same file name, the job fails. Here is what transpires in the logs: The container where task 200.0 is running is killed and the task is lost: 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. The task is re-attempted on a d
[jira] [Updated] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files
[ https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Safi updated SPARK-27194: -- Description: When a container fails for some reason (for example when killed by yarn for exceeding memory limits), the subsequent task attempts for the tasks that were running on that container all fail with a FileAlreadyExistsException. The original task attempt does not seem to successfully call abortTask (or at least its "best effort" delete is unsuccessful) and clean up the parquet file it was writing to, so when later task attempts try to write to the same spark-staging directory using the same file name, the job fails. Here is what transpires in the logs: The container where task 200.0 is running is killed and the task is lost: 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. The task is re-attempted on a different executor and fails because the part-00200-blah-blah.c000.snappy.parquet file from the first task attempt already exists: 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet for client 17.161.235.91 already exists The job fails when the the configured task attempts (spark.task.maxFailures) have failed with the same error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) ... Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet for client i.p.a.d already exists SPARK-26682 wasn't the root cause here, since there wasn't any stage reattempt. This seems that happens when dynamicPartitionOverwrite=dynamic. was: When a container fails for some reason (for example when killed by yarn for exceeding memory limits), the subsequent task attempts for the tasks that were running on that container all fail with a FileAlreadyExistsException. The original task attempt does not seem to successfully call abortTask (or at least its "best effort" delete is unsuccessful) and clean up the parquet file it was writing to, so when later task attempts try to write to the same spark-staging directory using the same file name, the job fails. Here is what transpires in the logs: The container where task 200.0 is running is killed and the task is lost: 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. The task is re-attempted on a different executor
[jira] [Created] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files
Reza Safi created SPARK-27194: - Summary: Job failures when task attempts do not clean up spark-staging parquet files Key: SPARK-27194 URL: https://issues.apache.org/jira/browse/SPARK-27194 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 2.3.2, 2.3.1 Reporter: Reza Safi When a container fails for some reason (for example when killed by yarn for exceeding memory limits), the subsequent task attempts for the tasks that were running on that container all fail with a FileAlreadyExistsException. The original task attempt does not seem to successfully call abortTask (or at least its "best effort" delete is unsuccessful) and clean up the parquet file it was writing to, so when later task attempts try to write to the same spark-staging directory using the same file name, the job fails. Here is what transpires in the logs: The container where task 200.0 is running is killed and the task is lost: 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. The task is re-attempted on a different executor and fails because the part-00200-blah-blah.c000.snappy.parquet file from the first task attempt already exists: 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet for client 17.161.235.91 already exists The job fails when the the configured task attempts (spark.task.maxFailures) have failed with the same error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) ... Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet for client i.p.a.d already exists SPARK-26682 wasn't the root cause here, since there wasn't any stage reattempt. This seems that happens when dynamicPartitionOverwrite=true. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mrinal Kanti Sardar resolved SPARK-27191. - Resolution: Not A Bug Fix Version/s: 2.3.0 Explained > union of dataframes depends on order of the columns in 2.4.0 > > > Key: SPARK-27191 > URL: https://issues.apache.org/jira/browse/SPARK-27191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Mrinal Kanti Sardar >Priority: Major > Fix For: 2.3.0 > > > Thought this issue was resolved in 2.3.0 according to > https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in > 2.4.0. > {code:java} > >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"]) > >>> df_1.show() > +++ > |col1| col2| > +++ > | 1aa|1bbb| > +++ > >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"]) > >>> df_2.show() > +++ > | col2|col1| > +++ > |2bbb| 2aa| > +++ > >>> df_u = df_1.union(df_2) > >>> df_u.show() > +++ > | col1| col2| > +++ > | 1aa|1bbb| > |2bbb| 2aa| > +++ > >>> spark.version > '2.4.0' > >>> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795170#comment-16795170 ] Mrinal Kanti Sardar commented on SPARK-27191: - Absolutely. I, somehow, missed `unionByName`. Thanks for the explanation. Will close this issue. > union of dataframes depends on order of the columns in 2.4.0 > > > Key: SPARK-27191 > URL: https://issues.apache.org/jira/browse/SPARK-27191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Mrinal Kanti Sardar >Priority: Major > > Thought this issue was resolved in 2.3.0 according to > https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in > 2.4.0. > {code:java} > >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"]) > >>> df_1.show() > +++ > |col1| col2| > +++ > | 1aa|1bbb| > +++ > >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"]) > >>> df_2.show() > +++ > | col2|col1| > +++ > |2bbb| 2aa| > +++ > >>> df_u = df_1.union(df_2) > >>> df_u.show() > +++ > | col1| col2| > +++ > | 1aa|1bbb| > |2bbb| 2aa| > +++ > >>> spark.version > '2.4.0' > >>> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795154#comment-16795154 ] Liang-Chi Hsieh commented on SPARK-27191: - Thanks for pining me and giving the answer [~yumwang][~dkbiswal]. As [~dkbiswal] said, {{union}} resolves columns by positions, so the behavior in the description is expected. I think the document of {{union}} explains it now. > union of dataframes depends on order of the columns in 2.4.0 > > > Key: SPARK-27191 > URL: https://issues.apache.org/jira/browse/SPARK-27191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Mrinal Kanti Sardar >Priority: Major > > Thought this issue was resolved in 2.3.0 according to > https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in > 2.4.0. > {code:java} > >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"]) > >>> df_1.show() > +++ > |col1| col2| > +++ > | 1aa|1bbb| > +++ > >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"]) > >>> df_2.show() > +++ > | col2|col1| > +++ > |2bbb| 2aa| > +++ > >>> df_u = df_1.union(df_2) > >>> df_u.show() > +++ > | col1| col2| > +++ > | 1aa|1bbb| > |2bbb| 2aa| > +++ > >>> spark.version > '2.4.0' > >>> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795166#comment-16795166 ] Dilip Biswal commented on SPARK-27191: -- [~viirya] Thank you very much. > union of dataframes depends on order of the columns in 2.4.0 > > > Key: SPARK-27191 > URL: https://issues.apache.org/jira/browse/SPARK-27191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Mrinal Kanti Sardar >Priority: Major > > Thought this issue was resolved in 2.3.0 according to > https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in > 2.4.0. > {code:java} > >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"]) > >>> df_1.show() > +++ > |col1| col2| > +++ > | 1aa|1bbb| > +++ > >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"]) > >>> df_2.show() > +++ > | col2|col1| > +++ > |2bbb| 2aa| > +++ > >>> df_u = df_1.union(df_2) > >>> df_u.show() > +++ > | col1| col2| > +++ > | 1aa|1bbb| > |2bbb| 2aa| > +++ > >>> spark.version > '2.4.0' > >>> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795150#comment-16795150 ] Dilip Biswal commented on SPARK-27191: -- Hello [~mrinal10449], The Jira you have referred to [link-22335|https://issues.apache.org/jira/browse/SPARK-22335 ], actually hasn't resulted in a code change. As a fix, [~viirya] has improved the documentation of the union API by clarifying that union api resolves the columns by their positions and not by name. Here is the link to the [PR|https://github.com/apache/spark/pull/19570/files]. The recommended method for your use case is to use 'unionByName'. > union of dataframes depends on order of the columns in 2.4.0 > > > Key: SPARK-27191 > URL: https://issues.apache.org/jira/browse/SPARK-27191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Mrinal Kanti Sardar >Priority: Major > > Thought this issue was resolved in 2.3.0 according to > https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in > 2.4.0. > {code:java} > >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"]) > >>> df_1.show() > +++ > |col1| col2| > +++ > | 1aa|1bbb| > +++ > >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"]) > >>> df_2.show() > +++ > | col2|col1| > +++ > |2bbb| 2aa| > +++ > >>> df_u = df_1.union(df_2) > >>> df_u.show() > +++ > | col1| col2| > +++ > | 1aa|1bbb| > |2bbb| 2aa| > +++ > >>> spark.version > '2.4.0' > >>> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795150#comment-16795150 ] Dilip Biswal edited comment on SPARK-27191 at 3/18/19 4:09 PM: --- Hello [~mrinal10449], The Jira you have referred to [link-22335|https://issues.apache.org/jira/browse/SPARK-22335 ], actually hasn't resulted in a code change. As a fix, [~viirya] has improved the documentation of the union API by clarifying that union api resolves the columns by their positions and not by name. Here is the link to the [PR|https://github.com/apache/spark/pull/19570/files]. The recommended method for your use case is to use 'unionByName' instead. was (Author: dkbiswal): Hello [~mrinal10449], The Jira you have referred to [link-22335|https://issues.apache.org/jira/browse/SPARK-22335 ], actually hasn't resulted in a code change. As a fix, [~viirya] has improved the documentation of the union API by clarifying that union api resolves the columns by their positions and not by name. Here is the link to the [PR|https://github.com/apache/spark/pull/19570/files]. The recommended method for your use case is to use 'unionByName'. > union of dataframes depends on order of the columns in 2.4.0 > > > Key: SPARK-27191 > URL: https://issues.apache.org/jira/browse/SPARK-27191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Mrinal Kanti Sardar >Priority: Major > > Thought this issue was resolved in 2.3.0 according to > https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in > 2.4.0. > {code:java} > >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"]) > >>> df_1.show() > +++ > |col1| col2| > +++ > | 1aa|1bbb| > +++ > >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"]) > >>> df_2.show() > +++ > | col2|col1| > +++ > |2bbb| 2aa| > +++ > >>> df_u = df_1.union(df_2) > >>> df_u.show() > +++ > | col1| col2| > +++ > | 1aa|1bbb| > |2bbb| 2aa| > +++ > >>> spark.version > '2.4.0' > >>> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27193) CodeFormatter should format multi comment lines correctly
[ https://issues.apache.org/jira/browse/SPARK-27193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-27193: Priority: Trivial (was: Major) > CodeFormatter should format multi comment lines correctly > - > > Key: SPARK-27193 > URL: https://issues.apache.org/jira/browse/SPARK-27193 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: wuyi >Priority: Trivial > > when enable `spark.sql.codegen.comments`, there will be multiple comment > lines. However, CodeFormatter can not handle multi comment lines currently: > > Generated code: > /* 001 */ public Object generate(Object[] references) { > /* 002 */ return new GeneratedIteratorForCodegenStage1(references); > /* 003 */ } > /* 004 */ > /* 005 */ /** > \* Codegend pipeline for stage (id=1) > \* *(1) Project [(id#0L + 1) AS (id + 1)#3L] > \* +- *(1) Filter (id#0L = 1) > \*+- *(1) Range (0, 10, step=1, splits=4) > \*/ > /* 006 */ // codegenStageId=1 > /* 007 */ final class GeneratedIteratorForCodegenStage1 extends > org.apache.spark.sql.execution.BufferedRowIterator { -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting
[ https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-27112. -- Resolution: Fixed Assignee: Parth Gandhi Fix Version/s: 3.0.0 Fixed by https://github.com/apache/spark/pull/24072 > Spark Scheduler encounters two independent Deadlocks when trying to kill > executors either due to dynamic allocation or blacklisting > > > Key: SPARK-27112 > URL: https://issues.apache.org/jira/browse/SPARK-27112 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Parth Gandhi >Assignee: Parth Gandhi >Priority: Major > Fix For: 3.0.0 > > Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot > 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, > Screen Shot 2019-02-26 at 4.11.26 PM.png > > > Recently, a few spark users in the organization have reported that their jobs > were getting stuck. On further analysis, it was found out that there exist > two independent deadlocks and either of them occur under different > circumstances. The screenshots for these two deadlocks are attached here. > We were able to reproduce the deadlocks with the following piece of code: > > {code:java} > import org.apache.hadoop.conf.Configuration > import org.apache.hadoop.fs.{FileSystem, Path} > import org.apache.spark._ > import org.apache.spark.TaskContext > // Simple example of Word Count in Scala > object ScalaWordCount { > def main(args: Array[String]) { > if (args.length < 2) { > System.err.println("Usage: ScalaWordCount ") > System.exit(1) > } > val conf = new SparkConf().setAppName("Scala Word Count") > val sc = new SparkContext(conf) > // get the input file uri > val inputFilesUri = args(0) > // get the output file uri > val outputFilesUri = args(1) > while (true) { > val textFile = sc.textFile(inputFilesUri) > val counts = textFile.flatMap(line => line.split(" ")) > .map(word => {if (TaskContext.get.partitionId == 5 && > TaskContext.get.attemptNumber == 0) throw new Exception("Fail for > blacklisting") else (word, 1)}) > .reduceByKey(_ + _) > counts.saveAsTextFile(outputFilesUri) > val conf: Configuration = new Configuration() > val path: Path = new Path(outputFilesUri) > val hdfs: FileSystem = FileSystem.get(conf) > hdfs.delete(path, true) > } > sc.stop() > } > } > {code} > > Additionally, to ensure that the deadlock surfaces up soon enough, I also > added a small delay in the Spark code here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256] > > {code:java} > executorIdToFailureList.remove(exec) > updateNextExpiryTime() > Thread.sleep(2000) > killBlacklistedExecutor(exec) > {code} > > Also make sure that the following configs are set when launching the above > spark job: > *spark.blacklist.enabled=true* > *spark.blacklist.killBlacklistedExecutors=true* > *spark.blacklist.application.maxFailedTasksPerExecutor=1* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27193) CodeFormatter should format multi comment lines correctly
[ https://issues.apache.org/jira/browse/SPARK-27193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-27193: - Description: when enable `spark.sql.codegen.comments`, there will be multiple comment lines. However, CodeFormatter can not handle multi comment lines currently: Generated code: /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIteratorForCodegenStage1(references); /* 003 */ } /* 004 */ /* 005 */ /** \* Codegend pipeline for stage (id=1) \* *(1) Project [(id#0L + 1) AS (id + 1)#3L] \* +- *(1) Filter (id#0L = 1) \*+- *(1) Range (0, 10, step=1, splits=4) \*/ /* 006 */ // codegenStageId=1 /* 007 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { was: when enable `spark.sql.codegen.comments`, there will be multiple comment lines. However, CodeFormatter can not handle multi comment lines currently: Generated code: /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIteratorForCodegenStage1(references); /* 003 */ } /* 004 */ /* 005 */ /** * Codegend pipeline for stage (id=1) * *(1) Project [(id#0L + 1) AS (id + 1)#3L] * +- *(1) Filter (id#0L = 1) *+- *(1) Range (0, 10, step=1, splits=4) */ /* 006 */ // codegenStageId=1 /* 007 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { > CodeFormatter should format multi comment lines correctly > - > > Key: SPARK-27193 > URL: https://issues.apache.org/jira/browse/SPARK-27193 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: wuyi >Priority: Major > > when enable `spark.sql.codegen.comments`, there will be multiple comment > lines. However, CodeFormatter can not handle multi comment lines currently: > > Generated code: > /* 001 */ public Object generate(Object[] references) { > /* 002 */ return new GeneratedIteratorForCodegenStage1(references); > /* 003 */ } > /* 004 */ > /* 005 */ /** > \* Codegend pipeline for stage (id=1) > \* *(1) Project [(id#0L + 1) AS (id + 1)#3L] > \* +- *(1) Filter (id#0L = 1) > \*+- *(1) Range (0, 10, step=1, splits=4) > \*/ > /* 006 */ // codegenStageId=1 > /* 007 */ final class GeneratedIteratorForCodegenStage1 extends > org.apache.spark.sql.execution.BufferedRowIterator { -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27193) CodeFormatter should format multi comment lines correctly
[ https://issues.apache.org/jira/browse/SPARK-27193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27193: Assignee: (was: Apache Spark) > CodeFormatter should format multi comment lines correctly > - > > Key: SPARK-27193 > URL: https://issues.apache.org/jira/browse/SPARK-27193 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: wuyi >Priority: Major > > when enable `spark.sql.codegen.comments`, there will be multiple comment > lines. However, CodeFormatter can not handle multi comment lines currently: > > Generated code: > /* 001 */ public Object generate(Object[] references) { > /* 002 */ return new GeneratedIteratorForCodegenStage1(references); > /* 003 */ } > /* 004 */ > /* 005 */ /** > * Codegend pipeline for stage (id=1) > * *(1) Project [(id#0L + 1) AS (id + 1)#3L] > * +- *(1) Filter (id#0L = 1) > *+- *(1) Range (0, 10, step=1, splits=4) > */ > /* 006 */ // codegenStageId=1 > /* 007 */ final class GeneratedIteratorForCodegenStage1 extends > org.apache.spark.sql.execution.BufferedRowIterator { -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27193) CodeFormatter should format multi comment lines correctly
[ https://issues.apache.org/jira/browse/SPARK-27193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27193: Assignee: Apache Spark > CodeFormatter should format multi comment lines correctly > - > > Key: SPARK-27193 > URL: https://issues.apache.org/jira/browse/SPARK-27193 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > when enable `spark.sql.codegen.comments`, there will be multiple comment > lines. However, CodeFormatter can not handle multi comment lines currently: > > Generated code: > /* 001 */ public Object generate(Object[] references) { > /* 002 */ return new GeneratedIteratorForCodegenStage1(references); > /* 003 */ } > /* 004 */ > /* 005 */ /** > * Codegend pipeline for stage (id=1) > * *(1) Project [(id#0L + 1) AS (id + 1)#3L] > * +- *(1) Filter (id#0L = 1) > *+- *(1) Range (0, 10, step=1, splits=4) > */ > /* 006 */ // codegenStageId=1 > /* 007 */ final class GeneratedIteratorForCodegenStage1 extends > org.apache.spark.sql.execution.BufferedRowIterator { -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27193) CodeFormatter should format multi comment lines correctly
wuyi created SPARK-27193: Summary: CodeFormatter should format multi comment lines correctly Key: SPARK-27193 URL: https://issues.apache.org/jira/browse/SPARK-27193 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: wuyi when enable `spark.sql.codegen.comments`, there will be multiple comment lines. However, CodeFormatter can not handle multi comment lines currently: Generated code: /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIteratorForCodegenStage1(references); /* 003 */ } /* 004 */ /* 005 */ /** * Codegend pipeline for stage (id=1) * *(1) Project [(id#0L + 1) AS (id + 1)#3L] * +- *(1) Filter (id#0L = 1) *+- *(1) Range (0, 10, step=1, splits=4) */ /* 006 */ // codegenStageId=1 /* 007 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27189) Add Executor level memory usage metrics to the metrics system
[ https://issues.apache.org/jira/browse/SPARK-27189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27189: Assignee: (was: Apache Spark) > Add Executor level memory usage metrics to the metrics system > - > > Key: SPARK-27189 > URL: https://issues.apache.org/jira/browse/SPARK-27189 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > Attachments: Example_dashboard_Spark_Memory_Metrics.PNG > > > This proposes to add instrumentation of memory usage via the Spark > Dropwizard/Codahale metrics system. Memory usage metrics are available via > the Executor metrics, recently implemented as detailed in > https://issues.apache.org/jira/browse/SPARK-23206. > Making metrics usage metrics available via the Spark Dropwzard metrics system > allow to improve Spark performance dashboards and study memory usage, as in > the attached example graph. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27189) Add Executor level memory usage metrics to the metrics system
[ https://issues.apache.org/jira/browse/SPARK-27189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27189: Assignee: Apache Spark > Add Executor level memory usage metrics to the metrics system > - > > Key: SPARK-27189 > URL: https://issues.apache.org/jira/browse/SPARK-27189 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Apache Spark >Priority: Minor > Attachments: Example_dashboard_Spark_Memory_Metrics.PNG > > > This proposes to add instrumentation of memory usage via the Spark > Dropwizard/Codahale metrics system. Memory usage metrics are available via > the Executor metrics, recently implemented as detailed in > https://issues.apache.org/jira/browse/SPARK-23206. > Making metrics usage metrics available via the Spark Dropwzard metrics system > allow to improve Spark performance dashboards and study memory usage, as in > the attached example graph. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27189) Add Executor level memory usage metrics to the metrics system
[ https://issues.apache.org/jira/browse/SPARK-27189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali updated SPARK-27189: Summary: Add Executor level memory usage metrics to the metrics system (was: Add Executor level metrics to the metrics system) > Add Executor level memory usage metrics to the metrics system > - > > Key: SPARK-27189 > URL: https://issues.apache.org/jira/browse/SPARK-27189 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > Attachments: Example_dashboard_Spark_Memory_Metrics.PNG > > > This proposes to add instrumentation of memory usage via the Spark > Dropwizard/Codahale metrics system. Memory usage metrics are available via > the Executor metrics, recently implemented as detailed in > https://issues.apache.org/jira/browse/SPARK-23206. > Making metrics usage metrics available via the Spark Dropwzard metrics system > allow to improve Spark performance dashboards and study memory usage, as in > the attached example graph. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27192) spark.task.cpus should be less or equal than spark.task.cpus when use static executor allocation
[ https://issues.apache.org/jira/browse/SPARK-27192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27192: Assignee: (was: Apache Spark) > spark.task.cpus should be less or equal than spark.task.cpus when use static > executor allocation > > > Key: SPARK-27192 > URL: https://issues.apache.org/jira/browse/SPARK-27192 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0, 2.4.0 >Reporter: Lijia Liu >Priority: Major > > When use dynamic executor allocation, if we set spark.executor.cores small > than spark.task.cpus, exception will be thrown as follows: > '''spark.executor.cores must not be < spark.task.cpus''' > But, if dynamic executor allocation not enabled, spark will hang when submit > new job for TaskSchedulerImpl will not schedule a task in a executor which > available cores is small than > spark.task.cpus.See > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L351] > So, when start task scheduler, spark.task.cpus should be check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27192) spark.task.cpus should be less or equal than spark.task.cpus when use static executor allocation
[ https://issues.apache.org/jira/browse/SPARK-27192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27192: Assignee: Apache Spark > spark.task.cpus should be less or equal than spark.task.cpus when use static > executor allocation > > > Key: SPARK-27192 > URL: https://issues.apache.org/jira/browse/SPARK-27192 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0, 2.4.0 >Reporter: Lijia Liu >Assignee: Apache Spark >Priority: Major > > When use dynamic executor allocation, if we set spark.executor.cores small > than spark.task.cpus, exception will be thrown as follows: > '''spark.executor.cores must not be < spark.task.cpus''' > But, if dynamic executor allocation not enabled, spark will hang when submit > new job for TaskSchedulerImpl will not schedule a task in a executor which > available cores is small than > spark.task.cpus.See > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L351] > So, when start task scheduler, spark.task.cpus should be check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27192) spark.task.cpus should be less or equal than spark.task.cpus when use static executor allocation
Lijia Liu created SPARK-27192: - Summary: spark.task.cpus should be less or equal than spark.task.cpus when use static executor allocation Key: SPARK-27192 URL: https://issues.apache.org/jira/browse/SPARK-27192 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0, 2.3.0, 2.2.0 Reporter: Lijia Liu When use dynamic executor allocation, if we set spark.executor.cores small than spark.task.cpus, exception will be thrown as follows: '''spark.executor.cores must not be < spark.task.cpus''' But, if dynamic executor allocation not enabled, spark will hang when submit new job for TaskSchedulerImpl will not schedule a task in a executor which available cores is small than spark.task.cpus.See [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L351] So, when start task scheduler, spark.task.cpus should be check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mrinal Kanti Sardar updated SPARK-27191: Summary: union of dataframes depends on order of the columns in 2.4.0 (was: union of dataframes depends on order of the columns) > union of dataframes depends on order of the columns in 2.4.0 > > > Key: SPARK-27191 > URL: https://issues.apache.org/jira/browse/SPARK-27191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Mrinal Kanti Sardar >Priority: Major > > Thought this issue was resolved in 2.3.0 according to > https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in > 2.4.0. > {code:java} > >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"]) > >>> df_1.show() > +++ > |col1| col2| > +++ > | 1aa|1bbb| > +++ > >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"]) > >>> df_2.show() > +++ > | col2|col1| > +++ > |2bbb| 2aa| > +++ > >>> df_u = df_1.union(df_2) > >>> df_u.show() > +++ > | col1| col2| > +++ > | 1aa|1bbb| > |2bbb| 2aa| > +++ > >>> spark.version > '2.4.0' > >>> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795018#comment-16795018 ] Yuming Wang commented on SPARK-27191: - cc [~viirya] > union of dataframes depends on order of the columns in 2.4.0 > > > Key: SPARK-27191 > URL: https://issues.apache.org/jira/browse/SPARK-27191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Mrinal Kanti Sardar >Priority: Major > > Thought this issue was resolved in 2.3.0 according to > https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in > 2.4.0. > {code:java} > >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"]) > >>> df_1.show() > +++ > |col1| col2| > +++ > | 1aa|1bbb| > +++ > >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"]) > >>> df_2.show() > +++ > | col2|col1| > +++ > |2bbb| 2aa| > +++ > >>> df_u = df_1.union(df_2) > >>> df_u.show() > +++ > | col1| col2| > +++ > | 1aa|1bbb| > |2bbb| 2aa| > +++ > >>> spark.version > '2.4.0' > >>> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27191) union of dataframes depends on order of the columns
Mrinal Kanti Sardar created SPARK-27191: --- Summary: union of dataframes depends on order of the columns Key: SPARK-27191 URL: https://issues.apache.org/jira/browse/SPARK-27191 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Mrinal Kanti Sardar Thought this issue was resolved in 2.3.0 according to https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in 2.4.0. {code:java} >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"]) >>> df_1.show() +++ |col1| col2| +++ | 1aa|1bbb| +++ >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"]) >>> df_2.show() +++ | col2|col1| +++ |2bbb| 2aa| +++ >>> df_u = df_1.union(df_2) >>> df_u.show() +++ | col1| col2| +++ | 1aa|1bbb| |2bbb| 2aa| +++ >>> spark.version '2.4.0' >>> {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27190) Add DataSourceV2 capabilities for streaming
[ https://issues.apache.org/jira/browse/SPARK-27190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27190: Assignee: Wenchen Fan (was: Apache Spark) > Add DataSourceV2 capabilities for streaming > --- > > Key: SPARK-27190 > URL: https://issues.apache.org/jira/browse/SPARK-27190 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27089) Loss of precision during decimal division
[ https://issues.apache.org/jira/browse/SPARK-27089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794984#comment-16794984 ] Marco Gaido commented on SPARK-27089: - You can set: {{spark.sql.decimalOperations.allowPrecisionLoss}} to {{false}} and you will get the original behavior. Please see SPARK-22036 for more details. > Loss of precision during decimal division > - > > Key: SPARK-27089 > URL: https://issues.apache.org/jira/browse/SPARK-27089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: ylo0ztlmtusq >Priority: Major > > Spark looses decimal places when dividing decimal numbers. > > Expected behavior (In Spark 2.2.3 or before) > > {code:java} > scala> val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as > decimal(38,14)) as decimal(38,14)) val""" > sql: String = select cast(cast(3 as decimal(38,14)) / cast(9 as > decimal(38,14)) as decimal(38,14)) val > scala> spark.sql(sql).show > 19/03/07 21:23:51 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > ++ > | val| > ++ > |0.33| > ++ > {code} > > Current behavior (In Spark 2.3.2 and later) > > {code:java} > scala> val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as > decimal(38,14)) as decimal(38,14)) val""" > sql: String = select cast(cast(3 as decimal(38,14)) / cast(9 as > decimal(38,14)) as decimal(38,14)) val > scala> spark.sql(sql).show > ++ > | val| > ++ > |0.33| > ++ > {code} > > Seems to caused by {{promote_precision(38, 6) }} > > {code:java} > scala> spark.sql(sql).explain(true) > == Parsed Logical Plan == > Project [cast((cast(3 as decimal(38,14)) / cast(9 as decimal(38,14))) as > decimal(38,14)) AS val#20] > +- OneRowRelation > == Analyzed Logical Plan == > val: decimal(38,14) > Project [cast(CheckOverflow((promote_precision(cast(cast(3 as decimal(38,14)) > as decimal(38,14))) / promote_precision(cast(cast(9 as decimal(38,14)) as > decimal(38,14, DecimalType(38,6)) as decimal(38,14)) AS val#20] > +- OneRowRelation > == Optimized Logical Plan == > Project [0.33 AS val#20] > +- OneRowRelation > == Physical Plan == > *(1) Project [0.33 AS val#20] > +- Scan OneRowRelation[] > {code} > > Source https://stackoverflow.com/q/55046492 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27189) Add Executor level metrics to the metrics system
[ https://issues.apache.org/jira/browse/SPARK-27189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali updated SPARK-27189: Attachment: Example_dashboard_Spark_Memory_Metrics.PNG > Add Executor level metrics to the metrics system > > > Key: SPARK-27189 > URL: https://issues.apache.org/jira/browse/SPARK-27189 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > Attachments: Example_dashboard_Spark_Memory_Metrics.PNG > > > This proposes to add instrumentation of memory usage via the Spark > Dropwizard/Codahale metrics system. Memory usage metrics are available via > the Executor metrics, recently implemented as detailed in > https://issues.apache.org/jira/browse/SPARK-23206. > Making metrics usage metrics available via the Spark Dropwzard metrics system > allow to improve Spark performance dashboards and study memory usage, as in > the attached example graph. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27190) Add DataSourceV2 capabilities for streaming
[ https://issues.apache.org/jira/browse/SPARK-27190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27190: Assignee: Apache Spark (was: Wenchen Fan) > Add DataSourceV2 capabilities for streaming > --- > > Key: SPARK-27190 > URL: https://issues.apache.org/jira/browse/SPARK-27190 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27190) Add DataSourceV2 capabilities for streaming
Wenchen Fan created SPARK-27190: --- Summary: Add DataSourceV2 capabilities for streaming Key: SPARK-27190 URL: https://issues.apache.org/jira/browse/SPARK-27190 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27189) Add Executor level metrics to the metrics system
Luca Canali created SPARK-27189: --- Summary: Add Executor level metrics to the metrics system Key: SPARK-27189 URL: https://issues.apache.org/jira/browse/SPARK-27189 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Luca Canali Attachments: Example_dashboard_Spark_Memory_Metrics.PNG This proposes to add instrumentation of memory usage via the Spark Dropwizard/Codahale metrics system. Memory usage metrics are available via the Executor metrics, recently implemented as detailed in https://issues.apache.org/jira/browse/SPARK-23206. Making metrics usage metrics available via the Spark Dropwzard metrics system allow to improve Spark performance dashboards and study memory usage, as in the attached example graph. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21492) Memory leak in SortMergeJoin
[ https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794962#comment-16794962 ] Xiaoju Wu commented on SPARK-21492: --- Any updates? Do you have any discussion on the general fix instead of hack in SMJ? > Memory leak in SortMergeJoin > > > Key: SPARK-21492 > URL: https://issues.apache.org/jira/browse/SPARK-21492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0, 2.3.1, 3.0.0 >Reporter: Zhan Zhang >Priority: Major > > In SortMergeJoin, if the iterator is not exhausted, there will be memory leak > caused by the Sort. The memory is not released until the task end, and cannot > be used by other operators causing performance drop or OOM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27184) Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in config object
[ https://issues.apache.org/jira/browse/SPARK-27184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hehuiyuan updated SPARK-27184: -- Description: In the org.apache.spark.internal.config,we define the variables of FILES and JARS,we can use them instead of "spark.jars" and "spark.files". private[spark] val JARS = ConfigBuilder("spark.jars") .stringConf .toSequence .createWithDefault(Nil) private[spark] val FILES = ConfigBuilder("spark.files") .stringConf .toSequence .createWithDefault(Nil) was: In the org.apache.spark.internal.object,we define the variables of FILES and JARS,we can use them instead of "spark.jars" and "spark.files". private[spark] val JARS = ConfigBuilder("spark.jars") .stringConf .toSequence .createWithDefault(Nil) private[spark] val FILES = ConfigBuilder("spark.files") .stringConf .toSequence .createWithDefault(Nil) > Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in > config object > > > Key: SPARK-27184 > URL: https://issues.apache.org/jira/browse/SPARK-27184 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: hehuiyuan >Priority: Minor > > In the org.apache.spark.internal.config,we define the variables of FILES and > JARS,we can use them instead of "spark.jars" and "spark.files". > private[spark] val JARS = ConfigBuilder("spark.jars") > .stringConf > .toSequence > .createWithDefault(Nil) > private[spark] val FILES = ConfigBuilder("spark.files") > .stringConf > .toSequence > .createWithDefault(Nil) > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters
[ https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27063. --- Resolution: Won't Fix > Spark on K8S Integration Tests timeouts are too short for some test clusters > > > Key: SPARK-27063 > URL: https://issues.apache.org/jira/browse/SPARK-27063 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Rob Vesse >Priority: Minor > > As noted during development for SPARK-26729 there are a couple of integration > test timeouts that are too short when running on slower clusters e.g. > developers laptops, small CI clusters etc > [~skonto] confirmed that he has also experienced this behaviour in the > discussion on PR [PR > 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938] > We should up the defaults of this timeouts as an initial step and longer term > consider making the timeouts themselves configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25837) Web UI does not respect spark.ui.retainedJobs in some instances
[ https://issues.apache.org/jira/browse/SPARK-25837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794932#comment-16794932 ] Xiaoju Wu commented on SPARK-25837: --- Did you verify this fix with the reproduce case above? I tried and found the issue is still there: the cleanup was still backed up but better than the version without this fix. > Web UI does not respect spark.ui.retainedJobs in some instances > --- > > Key: SPARK-25837 > URL: https://issues.apache.org/jira/browse/SPARK-25837 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 > Environment: Reproduction Environment: > Spark 2.3.1 > Dataproc 1.3-deb9 > 1x master 4 vCPUs, 15 GB > 2x workers 4 vCPUs, 15 GB > >Reporter: Patrick Brown >Assignee: Patrick Brown >Priority: Minor > Fix For: 2.3.3, 2.4.1, 3.0.0 > > Attachments: Screen Shot 2018-10-23 at 4.40.51 PM (1).png > > > Expected Behavior: Web UI only displays 1 completed job and remains > responsive. > Actual Behavior: Both during job execution and following all job completion > for some non short amount of time the UI retains many completed jobs, causing > limited responsiveness. > > To reproduce: > > > spark-shell --conf spark.ui.retainedJobs=1 > > scala> import scala.concurrent._ > scala> import scala.concurrent.ExecutionContext.Implicits.global > scala> for (i <- 0 until 5) { Future > { println(sc.parallelize(0 until i).collect.length) } > } > > > > The attached screenshot shows the state of the webui after running the repro > code, you can see the ui is displaying some 43k completed jobs (takes a long > time to load) after a few minutes of inactivity this will clear out, however > in an application which continues to submit jobs every once in a while, the > issue persists. > > The issue seems to appear when running multiple jobs at once as well as in > sequence for a while and may as well have something to do with high master > CPU usage (thus the collect in the repro code). My rough guess would be > whatever is managing clearing out completed jobs gets overwhelmed (on the > master during repro htop reported almost full CPU usage across all 4 cores). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26811) Add DataSourceV2 capabilities to check support for batch append, overwrite, truncate during analysis.
[ https://issues.apache.org/jira/browse/SPARK-26811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26811: --- Assignee: Ryan Blue > Add DataSourceV2 capabilities to check support for batch append, overwrite, > truncate during analysis. > - > > Key: SPARK-26811 > URL: https://issues.apache.org/jira/browse/SPARK-26811 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26811) Add DataSourceV2 capabilities to check support for batch append, overwrite, truncate during analysis.
[ https://issues.apache.org/jira/browse/SPARK-26811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26811. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24012 [https://github.com/apache/spark/pull/24012] > Add DataSourceV2 capabilities to check support for batch append, overwrite, > truncate during analysis. > - > > Key: SPARK-26811 > URL: https://issues.apache.org/jira/browse/SPARK-26811 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27186) Optimize SortShuffleWriter writing process
[ https://issues.apache.org/jira/browse/SPARK-27186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27186: Fix Version/s: (was: 3.0.0) > Optimize SortShuffleWriter writing process > -- > > Key: SPARK-27186 > URL: https://issues.apache.org/jira/browse/SPARK-27186 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: wangjiaochun >Priority: Minor > > If the SortShuffleWriter.write method records is empty, it should be return > directly and no need to keep running,refer to the process of > BypassMergeSortShuffleWriter.write. Here are the benefits of create Instance > ExternalSorter and tmp file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27188) FileStreamSink: provide a new option to disable metadata log
[ https://issues.apache.org/jira/browse/SPARK-27188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27188: Assignee: Apache Spark > FileStreamSink: provide a new option to disable metadata log > > > Key: SPARK-27188 > URL: https://issues.apache.org/jira/browse/SPARK-27188 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > From SPARK-24295 we indicated various end users are struggling with dealing > with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary > readers which leverage metadata log to determine which files are safely read > (to ensure 'exactly-once'), pruning metadata log is not trivial to implement. > While we may be able to deal with checking deleted output files in > FileStreamSink and get rid of them when compacting metadata, that operation > would take additional overhead for running query. (I'll try to address this > via another issue though.) > Back to the issue, 'exactly-once' via leveraging metadata is only possible > when output directory is being read by Spark, and for other cases it should > provide less guarantee. I think we could provide this as a workaround to > mitigate such issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27188) FileStreamSink: provide a new option to disable metadata log
[ https://issues.apache.org/jira/browse/SPARK-27188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27188: Assignee: (was: Apache Spark) > FileStreamSink: provide a new option to disable metadata log > > > Key: SPARK-27188 > URL: https://issues.apache.org/jira/browse/SPARK-27188 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > From SPARK-24295 we indicated various end users are struggling with dealing > with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary > readers which leverage metadata log to determine which files are safely read > (to ensure 'exactly-once'), pruning metadata log is not trivial to implement. > While we may be able to deal with checking deleted output files in > FileStreamSink and get rid of them when compacting metadata, that operation > would take additional overhead for running query. (I'll try to address this > via another issue though.) > Back to the issue, 'exactly-once' via leveraging metadata is only possible > when output directory is being read by Spark, and for other cases it should > provide less guarantee. I think we could provide this as a workaround to > mitigate such issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API
[ https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794904#comment-16794904 ] Gabor Somogyi commented on SPARK-27124: --- [~hyukjin.kwon] thanks for your time in the discussion. > Expose org.apache.spark.sql.avro.SchemaConverters as developer API > -- > > Key: SPARK-27124 > URL: https://issues.apache.org/jira/browse/SPARK-27124 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to > convert schema between Spark SQL and avro. This is reachable from scala side > but not from pyspark. I suggest to add this as a developer API to ease > development for pyspark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..
[ https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Garcia updated SPARK-27187: - Description: Hi Everyone, Is there any way I could determine what spark service or component serves the following files please refer below and where can I locate this files? I'm assuming it could be part of jar file that is being executed whenever a spark job is completed, however I'm not really sure of it. Appreciate if someone can point in the right direction on where I could check for this files. http:///history.7z http:///history.bak http:///history.bz2 http:///history.cfg http:///history.csv http:///history.dump http:///history.gz http:///history.ini http:///history.jar http:///history.old http:///history.ost http:///history.pst http:///history.sh http:///history.sln http:///history.sql http:///history.sql.bz2 http:///history.sql.gz http:///history.tar http:///history.tar.bz2 http:///history.tar.gz http:///history.war http:///history.zip This files was tag as possible sensitive files and possibly can be be exploited, as a precautionary measure can we restrict or remove this file from the website. Your help is highly appreciated. Best Regards, JG was: Hi Everyone, Is there any way I could determine what spark service or component serves the following files please refer below and where can I locate this files? I'm assuming it could be part of jar file that is being executed whenever a spark job is completed, however I'm not really sure of it. And is it safe to move or delete this files ? Appreciate if someone can point in the right direction on where I could check for this files. http:///history.7z http:///history.bak http:///history.bz2 http:///history.cfg http:///history.csv http:///history.dump http:///history.gz http:///history.ini http:///history.jar http:///history.old http:///history.ost http:///history.pst http:///history.sh http:///history.sln http:///history.sql http:///history.sql.bz2 http:///history.sql.gz http:///history.tar http:///history.tar.bz2 http:///history.tar.gz http:///history.war http:///history.zip This files was tag as possible sensitive files and possibly can be be exploited, as a precautionary measure can we restrict or remove this file from the website. Your help is highly appreciated. Best Regards, JG > What spark jar files serves the following files .. > -- > > Key: SPARK-27187 > URL: https://issues.apache.org/jira/browse/SPARK-27187 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.2 >Reporter: Jerry Garcia >Priority: Minor > > Hi Everyone, > Is there any way I could determine what spark service or component serves > the following files please refer below and where can I locate this files? I'm > assuming it could be part of jar file that is being executed whenever a spark > job is completed, however I'm not really sure of it. Appreciate if someone > can point in the right direction on where I could check for this files. > http:///history.7z > http:///history.bak > http:///history.bz2 > http:///history.cfg > http:///history.csv > http:///history.dump > http:///history.gz > http:///history.ini > http:///history.jar > http:///history.old > http:///history.ost > http:///history.pst > http:///history.sh > http:///history.sln > http:///history.sql > http:///history.sql.bz2 > http:///history.sql.gz > http:///history.tar > http:///history.tar.bz2 > http:///history.tar.gz > http:///history.war > http:///history.zip > This files was tag as possible sensitive files and possibly can be be > exploited, as a precautionary measure can we restrict or remove this file > from the website. > Your help is highly appreciated. > > Best Regards, > JG > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27188) FileStreamSink: provide a new option to disable metadata log
Jungtaek Lim created SPARK-27188: Summary: FileStreamSink: provide a new option to disable metadata log Key: SPARK-27188 URL: https://issues.apache.org/jira/browse/SPARK-27188 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Jungtaek Lim >From SPARK-24295 we indicated various end users are struggling with dealing >with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary >readers which leverage metadata log to determine which files are safely read >(to ensure 'exactly-once'), pruning metadata log is not trivial to implement. While we may be able to deal with checking deleted output files in FileStreamSink and get rid of them when compacting metadata, that operation would take additional overhead for running query. (I'll try to address this via another issue though.) Back to the issue, 'exactly-once' via leveraging metadata is only possible when output directory is being read by Spark, and for other cases it should provide less guarantee. I think we could provide this as a workaround to mitigate such issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..
[ https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Garcia updated SPARK-27187: - Description: Hello Everyone, Is there any way I could determine what spark service or component serves the following files please refer below and where can I locate this files? I'm assuming it could be part of jar file that is being executed whenever a spark job is running, but i could also be wrong. And is it safe to move or delete this files ? Appreciate if someone can point in the right direction on where I could check for this files. http:///history.7z http:///history.bak http:///history.bz2 http:///history.cfg http:///history.csv http:///history.dump http:///history.gz http:///history.ini http:///history.jar http:///history.old http:///history.ost http:///history.pst http:///history.sh http:///history.sln http:///history.sql http:///history.sql.bz2 http:///history.sql.gz http:///history.tar http:///history.tar.bz2 http:///history.tar.gz http:///history.war http:///history.zip This files was tag as possible sensitive files and possibly can be be exploited, as a precautionary measure can we restrict or remove this file from the website. Your help is highly appreciated. Best Regards, JG was: Hello Everyone, Is there any way I could determine what spark service or component serves the following files please refer below and where can I locate this files? Previous findings shows that they are part of spark history jar files, but i couldn't exactly pinpoint on what is the exact jar file that this file is coming from. And is it safe to move or delete this files ? Appreciate if someone can point in the right direction on where I could check for this files. http:///history.7z http:///history.bak http:///history.bz2 http:///history.cfg http:///history.csv http:///history.dump http:///history.gz http:///history.ini http:///history.jar http:///history.old http:///history.ost http:///history.pst http:///history.sh http:///history.sln http:///history.sql http:///history.sql.bz2 http:///history.sql.gz http:///history.tar http:///history.tar.bz2 http:///history.tar.gz http:///history.war http:///history.zip This files was tag as possible sensitive files and possibly can be be exploited, as a precautionary measure can we restrict or remove this file from the website. Your help is highly appreciated. Best Regards, JG > What spark jar files serves the following files .. > -- > > Key: SPARK-27187 > URL: https://issues.apache.org/jira/browse/SPARK-27187 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.2 >Reporter: Jerry Garcia >Priority: Minor > > Hello Everyone, > Is there any way I could determine what spark service or component serves > the following files please refer below and where can I locate this files? I'm > assuming it could be part of jar file that is being executed whenever a spark > job is running, but i could also be wrong. And is it safe to move or delete > this files ? Appreciate if someone can point in the right direction on where > I could check for this files. > http:///history.7z > http:///history.bak > http:///history.bz2 > http:///history.cfg > http:///history.csv > http:///history.dump > http:///history.gz > http:///history.ini > http:///history.jar > http:///history.old > http:///history.ost > http:///history.pst > http:///history.sh > http:///history.sln > http:///history.sql > http:///history.sql.bz2 > http:///history.sql.gz > http:///history.tar > http:///history.tar.bz2 > http:///history.tar.gz > http:///history.war > http:///history.zip > This files was tag as possible sensitive files and possibly can be be > exploited, as a precautionary measure can we restrict or remove this file > from the website. > Your help is highly appreciated. > > Best Regards, > JG > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..
[ https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Garcia updated SPARK-27187: - Description: Hi Everyone, Is there any way I could determine what spark service or component serves the following files please refer below and where can I locate this files? I'm assuming it could be part of jar file that is being executed whenever a spark job is completed, however I'm not really sure of it. And is it safe to move or delete this files ? Appreciate if someone can point in the right direction on where I could check for this files. http:///history.7z http:///history.bak http:///history.bz2 http:///history.cfg http:///history.csv http:///history.dump http:///history.gz http:///history.ini http:///history.jar http:///history.old http:///history.ost http:///history.pst http:///history.sh http:///history.sln http:///history.sql http:///history.sql.bz2 http:///history.sql.gz http:///history.tar http:///history.tar.bz2 http:///history.tar.gz http:///history.war http:///history.zip This files was tag as possible sensitive files and possibly can be be exploited, as a precautionary measure can we restrict or remove this file from the website. Your help is highly appreciated. Best Regards, JG was: Hello Everyone, Is there any way I could determine what spark service or component serves the following files please refer below and where can I locate this files? I'm assuming it could be part of jar file that is being executed whenever a spark job is completed, however I'm not really sure of it. And is it safe to move or delete this files ? Appreciate if someone can point in the right direction on where I could check for this files. http:///history.7z http:///history.bak http:///history.bz2 http:///history.cfg http:///history.csv http:///history.dump http:///history.gz http:///history.ini http:///history.jar http:///history.old http:///history.ost http:///history.pst http:///history.sh http:///history.sln http:///history.sql http:///history.sql.bz2 http:///history.sql.gz http:///history.tar http:///history.tar.bz2 http:///history.tar.gz http:///history.war http:///history.zip This files was tag as possible sensitive files and possibly can be be exploited, as a precautionary measure can we restrict or remove this file from the website. Your help is highly appreciated. Best Regards, JG > What spark jar files serves the following files .. > -- > > Key: SPARK-27187 > URL: https://issues.apache.org/jira/browse/SPARK-27187 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.2 >Reporter: Jerry Garcia >Priority: Minor > > Hi Everyone, > Is there any way I could determine what spark service or component serves > the following files please refer below and where can I locate this files? I'm > assuming it could be part of jar file that is being executed whenever a spark > job is completed, however I'm not really sure of it. And is it safe to move > or delete this files ? Appreciate if someone can point in the right direction > on where I could check for this files. > http:///history.7z > http:///history.bak > http:///history.bz2 > http:///history.cfg > http:///history.csv > http:///history.dump > http:///history.gz > http:///history.ini > http:///history.jar > http:///history.old > http:///history.ost > http:///history.pst > http:///history.sh > http:///history.sln > http:///history.sql > http:///history.sql.bz2 > http:///history.sql.gz > http:///history.tar > http:///history.tar.bz2 > http:///history.tar.gz > http:///history.war > http:///history.zip > This files was tag as possible sensitive files and possibly can be be > exploited, as a precautionary measure can we restrict or remove this file > from the website. > Your help is highly appreciated. > > Best Regards, > JG > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..
[ https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Garcia updated SPARK-27187: - Description: Hello Everyone, Is there any way I could determine what spark service or component serves the following files please refer below and where can I locate this files? I'm assuming it could be part of jar file that is being executed whenever a spark job is completed, however I'm not really sure of it. And is it safe to move or delete this files ? Appreciate if someone can point in the right direction on where I could check for this files. http:///history.7z http:///history.bak http:///history.bz2 http:///history.cfg http:///history.csv http:///history.dump http:///history.gz http:///history.ini http:///history.jar http:///history.old http:///history.ost http:///history.pst http:///history.sh http:///history.sln http:///history.sql http:///history.sql.bz2 http:///history.sql.gz http:///history.tar http:///history.tar.bz2 http:///history.tar.gz http:///history.war http:///history.zip This files was tag as possible sensitive files and possibly can be be exploited, as a precautionary measure can we restrict or remove this file from the website. Your help is highly appreciated. Best Regards, JG was: Hello Everyone, Is there any way I could determine what spark service or component serves the following files please refer below and where can I locate this files? I'm assuming it could be part of jar file that is being executed whenever a spark job is completed, but i could also be wrong. And is it safe to move or delete this files ? Appreciate if someone can point in the right direction on where I could check for this files. http:///history.7z http:///history.bak http:///history.bz2 http:///history.cfg http:///history.csv http:///history.dump http:///history.gz http:///history.ini http:///history.jar http:///history.old http:///history.ost http:///history.pst http:///history.sh http:///history.sln http:///history.sql http:///history.sql.bz2 http:///history.sql.gz http:///history.tar http:///history.tar.bz2 http:///history.tar.gz http:///history.war http:///history.zip This files was tag as possible sensitive files and possibly can be be exploited, as a precautionary measure can we restrict or remove this file from the website. Your help is highly appreciated. Best Regards, JG > What spark jar files serves the following files .. > -- > > Key: SPARK-27187 > URL: https://issues.apache.org/jira/browse/SPARK-27187 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.2 >Reporter: Jerry Garcia >Priority: Minor > > Hello Everyone, > Is there any way I could determine what spark service or component serves > the following files please refer below and where can I locate this files? I'm > assuming it could be part of jar file that is being executed whenever a spark > job is completed, however I'm not really sure of it. And is it safe to move > or delete this files ? Appreciate if someone can point in the right direction > on where I could check for this files. > http:///history.7z > http:///history.bak > http:///history.bz2 > http:///history.cfg > http:///history.csv > http:///history.dump > http:///history.gz > http:///history.ini > http:///history.jar > http:///history.old > http:///history.ost > http:///history.pst > http:///history.sh > http:///history.sln > http:///history.sql > http:///history.sql.bz2 > http:///history.sql.gz > http:///history.tar > http:///history.tar.bz2 > http:///history.tar.gz > http:///history.war > http:///history.zip > This files was tag as possible sensitive files and possibly can be be > exploited, as a precautionary measure can we restrict or remove this file > from the website. > Your help is highly appreciated. > > Best Regards, > JG > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..
[ https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Garcia updated SPARK-27187: - Description: Hello Everyone, Is there any way I could determine what spark service or component serves the following files please refer below and where can I locate this files? I'm assuming it could be part of jar file that is being executed whenever a spark job is completed, but i could also be wrong. And is it safe to move or delete this files ? Appreciate if someone can point in the right direction on where I could check for this files. http:///history.7z http:///history.bak http:///history.bz2 http:///history.cfg http:///history.csv http:///history.dump http:///history.gz http:///history.ini http:///history.jar http:///history.old http:///history.ost http:///history.pst http:///history.sh http:///history.sln http:///history.sql http:///history.sql.bz2 http:///history.sql.gz http:///history.tar http:///history.tar.bz2 http:///history.tar.gz http:///history.war http:///history.zip This files was tag as possible sensitive files and possibly can be be exploited, as a precautionary measure can we restrict or remove this file from the website. Your help is highly appreciated. Best Regards, JG was: Hello Everyone, Is there any way I could determine what spark service or component serves the following files please refer below and where can I locate this files? I'm assuming it could be part of jar file that is being executed whenever a spark job is running, but i could also be wrong. And is it safe to move or delete this files ? Appreciate if someone can point in the right direction on where I could check for this files. http:///history.7z http:///history.bak http:///history.bz2 http:///history.cfg http:///history.csv http:///history.dump http:///history.gz http:///history.ini http:///history.jar http:///history.old http:///history.ost http:///history.pst http:///history.sh http:///history.sln http:///history.sql http:///history.sql.bz2 http:///history.sql.gz http:///history.tar http:///history.tar.bz2 http:///history.tar.gz http:///history.war http:///history.zip This files was tag as possible sensitive files and possibly can be be exploited, as a precautionary measure can we restrict or remove this file from the website. Your help is highly appreciated. Best Regards, JG > What spark jar files serves the following files .. > -- > > Key: SPARK-27187 > URL: https://issues.apache.org/jira/browse/SPARK-27187 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.2 >Reporter: Jerry Garcia >Priority: Minor > > Hello Everyone, > Is there any way I could determine what spark service or component serves > the following files please refer below and where can I locate this files? I'm > assuming it could be part of jar file that is being executed whenever a spark > job is completed, but i could also be wrong. And is it safe to move or delete > this files ? Appreciate if someone can point in the right direction on where > I could check for this files. > http:///history.7z > http:///history.bak > http:///history.bz2 > http:///history.cfg > http:///history.csv > http:///history.dump > http:///history.gz > http:///history.ini > http:///history.jar > http:///history.old > http:///history.ost > http:///history.pst > http:///history.sh > http:///history.sln > http:///history.sql > http:///history.sql.bz2 > http:///history.sql.gz > http:///history.tar > http:///history.tar.bz2 > http:///history.tar.gz > http:///history.war > http:///history.zip > This files was tag as possible sensitive files and possibly can be be > exploited, as a precautionary measure can we restrict or remove this file > from the website. > Your help is highly appreciated. > > Best Regards, > JG > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27186) Optimize SortShuffleWriter writing process
[ https://issues.apache.org/jira/browse/SPARK-27186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794881#comment-16794881 ] Yuming Wang commented on SPARK-27186: - Please avoid to set {{Fix Version/s}} which is usually reserved for committers. > Optimize SortShuffleWriter writing process > -- > > Key: SPARK-27186 > URL: https://issues.apache.org/jira/browse/SPARK-27186 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: wangjiaochun >Priority: Minor > Fix For: 3.0.0 > > > If the SortShuffleWriter.write method records is empty, it should be return > directly and no need to keep running,refer to the process of > BypassMergeSortShuffleWriter.write. Here are the benefits of create Instance > ExternalSorter and tmp file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API
[ https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794883#comment-16794883 ] Hyukjin Kwon commented on SPARK-27124: -- Yea, thanks [~gsomogyi]. I'll keep my eyes on mailing list to see if we need to discuss about this API later! > Expose org.apache.spark.sql.avro.SchemaConverters as developer API > -- > > Key: SPARK-27124 > URL: https://issues.apache.org/jira/browse/SPARK-27124 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to > convert schema between Spark SQL and avro. This is reachable from scala side > but not from pyspark. I suggest to add this as a developer API to ease > development for pyspark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..
[ https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Garcia updated SPARK-27187: - Description: Hello Everyone, Is there any way I could determine what spark service or component serves the following files please refer below and where can I locate this files? Previous findings shows that they are part of spark history jar files, but i couldn't exactly pinpoint on what is the exact jar file that this file is coming from. And is it safe to move or delete this files ? Appreciate if someone can point in the right direction on where I could check for this files. http:///history.7z http:///history.bak http:///history.bz2 http:///history.cfg http:///history.csv http:///history.dump http:///history.gz http:///history.ini http:///history.jar http:///history.old http:///history.ost http:///history.pst http:///history.sh http:///history.sln http:///history.sql http:///history.sql.bz2 http:///history.sql.gz http:///history.tar http:///history.tar.bz2 http:///history.tar.gz http:///history.war http:///history.zip This files was tag as possible sensitive files and possibly can be be exploited, as a precautionary measure can we restrict or remove this file from the website. Your help is highly appreciated. Best Regards, JG was: Hello Everyone, Is there any way I could determine what spark service or component serves the following files please refer below and where can I locate this files? Previous findings shows that they are part of spark history jar files, but i couldn't exactly pinpoint on what is the exact jar file that this file is coming from. And is it safe to move or delete this files ? Appreciate if someone can point in the right direction on where I could check for this files. http:///history.7z http:///history.bak http:///history.bz2 http:///history.cfg http:///history.csv http:///history.dump http:///history.gz http:///history.ini http:///history.jar http:///history.old http:///history.ost http:///history.pst http:///history.sh http:///history.sln http:///history.sql http:///history.sql.bz2 http:///history.sql.gz http:///history.tar http:///history.tar.bz2 http:///history.tar.gz http:///history.war http:///history.zip Your help is highly appreciated. Best Regards, > What spark jar files serves the following files .. > -- > > Key: SPARK-27187 > URL: https://issues.apache.org/jira/browse/SPARK-27187 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.2 >Reporter: Jerry Garcia >Priority: Minor > > Hello Everyone, > Is there any way I could determine what spark service or component serves > the following files please refer below and where can I locate this files? > Previous findings shows that they are part of spark history jar files, but i > couldn't exactly pinpoint on what is the exact jar file that this file is > coming from. And is it safe to move or delete this files ? Appreciate if > someone can point in the right direction on where I could check for this > files. > http:///history.7z > http:///history.bak > http:///history.bz2 > http:///history.cfg > http:///history.csv > http:///history.dump > http:///history.gz > http:///history.ini > http:///history.jar > http:///history.old > http:///history.ost > http:///history.pst > http:///history.sh > http:///history.sln > http:///history.sql > http:///history.sql.bz2 > http:///history.sql.gz > http:///history.tar > http:///history.tar.bz2 > http:///history.tar.gz > http:///history.war > http:///history.zip > This files was tag as possible sensitive files and possibly can be be > exploited, as a precautionary measure can we restrict or remove this file > from the website. > Your help is highly appreciated. > > Best Regards, > JG > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..
[ https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Garcia updated SPARK-27187: - Issue Type: Bug (was: Question) > What spark jar files serves the following files .. > -- > > Key: SPARK-27187 > URL: https://issues.apache.org/jira/browse/SPARK-27187 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.2 >Reporter: Jerry Garcia >Priority: Minor > > Hello Everyone, > Is there any way I could determine what spark service or component serves > the following files please refer below and where can I locate this files? > Previous findings shows that they are part of spark history jar files, but i > couldn't exactly pinpoint on what is the exact jar file that this file is > coming from. And is it safe to move or delete this files ? Appreciate if > someone can point in the right direction on where I could check for this > files. > http:///history.7z > http:///history.bak > http:///history.bz2 > http:///history.cfg > http:///history.csv > http:///history.dump > http:///history.gz > http:///history.ini > http:///history.jar > http:///history.old > http:///history.ost > http:///history.pst > http:///history.sh > http:///history.sln > http:///history.sql > http:///history.sql.bz2 > http:///history.sql.gz > http:///history.tar > http:///history.tar.bz2 > http:///history.tar.gz > http:///history.war > http:///history.zip > Your help is highly appreciated. > Best Regards, > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..
[ https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Garcia updated SPARK-27187: - Description: Hello Everyone, Is there any way I could determine what spark service or component serves the following files please refer below and where can I locate this files? Previous findings shows that they are part of spark history jar files, but i couldn't exactly pinpoint on what is the exact jar file that this file is coming from. And is it safe to move or delete this files ? Appreciate if someone can point in the right direction on where I could check for this files. http:///history.7z http:///history.bak http:///history.bz2 http:///history.cfg http:///history.csv http:///history.dump http:///history.gz http:///history.ini http:///history.jar http:///history.old http:///history.ost http:///history.pst http:///history.sh http:///history.sln http:///history.sql http:///history.sql.bz2 http:///history.sql.gz http:///history.tar http:///history.tar.bz2 http:///history.tar.gz http:///history.war http:///history.zip Your help is highly appreciated. Best Regards, was: Hello Everyone, Is there any way I could determine what spark service or component serves the following files please refer below and where can I locate this files? Previous findings shows that they are part of spark history jar files, but i couldn't exactly pinpoint on what is the exact jar file that this file is coming from. And is it safe to move or delete this files ? http:///history.7z http:///history.bak http:///history.bz2 http:///history.cfg http:///history.csv http:///history.dump http:///history.gz http:///history.ini http:///history.jar http:///history.old http:///history.ost http:///history.pst http:///history.sh http:///history.sln http:///history.sql http:///history.sql.bz2 http:///history.sql.gz http:///history.tar http:///history.tar.bz2 http:///history.tar.gz http:///history.war http:///history.zip Your help is highly appreciated. Best Regards, > What spark jar files serves the following files .. > -- > > Key: SPARK-27187 > URL: https://issues.apache.org/jira/browse/SPARK-27187 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 1.6.2 >Reporter: Jerry Garcia >Priority: Minor > > Hello Everyone, > Is there any way I could determine what spark service or component serves > the following files please refer below and where can I locate this files? > Previous findings shows that they are part of spark history jar files, but i > couldn't exactly pinpoint on what is the exact jar file that this file is > coming from. And is it safe to move or delete this files ? Appreciate if > someone can point in the right direction on where I could check for this > files. > http:///history.7z > http:///history.bak > http:///history.bz2 > http:///history.cfg > http:///history.csv > http:///history.dump > http:///history.gz > http:///history.ini > http:///history.jar > http:///history.old > http:///history.ost > http:///history.pst > http:///history.sh > http:///history.sln > http:///history.sql > http:///history.sql.bz2 > http:///history.sql.gz > http:///history.tar > http:///history.tar.bz2 > http:///history.tar.gz > http:///history.war > http:///history.zip > Your help is highly appreciated. > Best Regards, > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..
[ https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Garcia updated SPARK-27187: - Description: Hello Everyone, Is there any way I could determine what spark service or component serves the following files please refer below and where can I locate this files? Previous findings shows that they are part of spark history jar files, but i couldn't exactly pinpoint on what is the exact jar file that this file is coming from. And is it safe to move or delete this files ? http:///history.7z http:///history.bak http:///history.bz2 http:///history.cfg http:///history.csv http:///history.dump http:///history.gz http:///history.ini http:///history.jar http:///history.old http:///history.ost http:///history.pst http:///history.sh http:///history.sln http:///history.sql http:///history.sql.bz2 http:///history.sql.gz http:///history.tar http:///history.tar.bz2 http:///history.tar.gz http:///history.war http:///history.zip Your help is highly appreciated. Best Regards, > What spark jar files serves the following files .. > -- > > Key: SPARK-27187 > URL: https://issues.apache.org/jira/browse/SPARK-27187 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 1.6.2 >Reporter: Jerry Garcia >Priority: Minor > > Hello Everyone, > Is there any way I could determine what spark service or component serves > the following files please refer below and where can I locate this files? > Previous findings shows that they are part of spark history jar files, but i > couldn't exactly pinpoint on what is the exact jar file that this file is > coming from. And is it safe to move or delete this files ? > http:///history.7z > http:///history.bak > http:///history.bz2 > http:///history.cfg > http:///history.csv > http:///history.dump > http:///history.gz > http:///history.ini > http:///history.jar > http:///history.old > http:///history.ost > http:///history.pst > http:///history.sh > http:///history.sln > http:///history.sql > http:///history.sql.bz2 > http:///history.sql.gz > http:///history.tar > http:///history.tar.bz2 > http:///history.tar.gz > http:///history.war > http:///history.zip > Your help is highly appreciated. > > Best Regards, > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27186) Optimize SortShuffleWriter writing process
[ https://issues.apache.org/jira/browse/SPARK-27186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27186: Assignee: Apache Spark > Optimize SortShuffleWriter writing process > -- > > Key: SPARK-27186 > URL: https://issues.apache.org/jira/browse/SPARK-27186 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: wangjiaochun >Assignee: Apache Spark >Priority: Minor > Fix For: 3.0.0 > > > If the SortShuffleWriter.write method records is empty, it should be return > directly and no need to keep running,refer to the process of > BypassMergeSortShuffleWriter.write. Here are the benefits of create Instance > ExternalSorter and tmp file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27187) What spark jar files serves the following files ..
Jerry Garcia created SPARK-27187: Summary: What spark jar files serves the following files .. Key: SPARK-27187 URL: https://issues.apache.org/jira/browse/SPARK-27187 Project: Spark Issue Type: Question Components: Spark Submit Affects Versions: 1.6.2 Reporter: Jerry Garcia -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27186) Optimize SortShuffleWriter writing process
[ https://issues.apache.org/jira/browse/SPARK-27186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27186: Assignee: (was: Apache Spark) > Optimize SortShuffleWriter writing process > -- > > Key: SPARK-27186 > URL: https://issues.apache.org/jira/browse/SPARK-27186 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: wangjiaochun >Priority: Minor > Fix For: 3.0.0 > > > If the SortShuffleWriter.write method records is empty, it should be return > directly and no need to keep running,refer to the process of > BypassMergeSortShuffleWriter.write. Here are the benefits of create Instance > ExternalSorter and tmp file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27186) Optimize SortShuffleWriter writing process
wangjiaochun created SPARK-27186: Summary: Optimize SortShuffleWriter writing process Key: SPARK-27186 URL: https://issues.apache.org/jira/browse/SPARK-27186 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.0.0 Reporter: wangjiaochun Fix For: 3.0.0 If the SortShuffleWriter.write method records is empty, it should be return directly and no need to keep running,refer to the process of BypassMergeSortShuffleWriter.write. Here are the benefits of create Instance ExternalSorter and tmp file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26961) Found Java-level deadlock in Spark Driver
[ https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26961: Assignee: Apache Spark > Found Java-level deadlock in Spark Driver > - > > Key: SPARK-26961 > URL: https://issues.apache.org/jira/browse/SPARK-26961 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Rong Jialei >Assignee: Apache Spark >Priority: Major > Attachments: image-2019-03-13-19-53-52-390.png > > > Our spark job usually will finish in minutes, however, we recently found it > take days to run, and we can only kill it when this happened. > An investigation show all worker container could not connect drive after > start, and driver is hanging, using jstack, we found a Java-level deadlock. > > *Jstack output for deadlock part is showing below:* > > Found one Java-level deadlock: > = > "SparkUI-907": > waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a > org.apache.hadoop.conf.Configuration), > which is held by "ForkJoinPool-1-worker-57" > "ForkJoinPool-1-worker-57": > waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a > org.apache.spark.util.MutableURLClassLoader), > which is held by "ForkJoinPool-1-worker-7" > "ForkJoinPool-1-worker-7": > waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a > org.apache.hadoop.conf.Configuration), > which is held by "ForkJoinPool-1-worker-57" > Java stack information for the threads listed above: > === > "SparkUI-907": > at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328) > - waiting to lock <0x0005c0c1e5e0> (a > org.apache.hadoop.conf.Configuration) > at > org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088) > at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145) > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363) > at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840) > at > org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74) > at java.net.URL.getURLStreamHandler(URL.java:1142) > at java.net.URL.(URL.java:599) > at java.net.URL.(URL.java:490) > at java.net.URL.(URL.java:439) > at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176) > at org.apache.spark.ui.JettyUtils$$anon$4.doGet(JettyUtils.scala:161) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:534) > at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748) > "ForkJoinPool-1-worker-57": > at java.lang.ClassLoader.loadClass(ClassLoader.java:404) > - waiting to lock <0x0005b7991168>
[jira] [Assigned] (SPARK-26961) Found Java-level deadlock in Spark Driver
[ https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26961: Assignee: (was: Apache Spark) > Found Java-level deadlock in Spark Driver > - > > Key: SPARK-26961 > URL: https://issues.apache.org/jira/browse/SPARK-26961 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Rong Jialei >Priority: Major > Attachments: image-2019-03-13-19-53-52-390.png > > > Our spark job usually will finish in minutes, however, we recently found it > take days to run, and we can only kill it when this happened. > An investigation show all worker container could not connect drive after > start, and driver is hanging, using jstack, we found a Java-level deadlock. > > *Jstack output for deadlock part is showing below:* > > Found one Java-level deadlock: > = > "SparkUI-907": > waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a > org.apache.hadoop.conf.Configuration), > which is held by "ForkJoinPool-1-worker-57" > "ForkJoinPool-1-worker-57": > waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a > org.apache.spark.util.MutableURLClassLoader), > which is held by "ForkJoinPool-1-worker-7" > "ForkJoinPool-1-worker-7": > waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a > org.apache.hadoop.conf.Configuration), > which is held by "ForkJoinPool-1-worker-57" > Java stack information for the threads listed above: > === > "SparkUI-907": > at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328) > - waiting to lock <0x0005c0c1e5e0> (a > org.apache.hadoop.conf.Configuration) > at > org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088) > at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145) > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363) > at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840) > at > org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74) > at java.net.URL.getURLStreamHandler(URL.java:1142) > at java.net.URL.(URL.java:599) > at java.net.URL.(URL.java:490) > at java.net.URL.(URL.java:439) > at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176) > at org.apache.spark.ui.JettyUtils$$anon$4.doGet(JettyUtils.scala:161) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:534) > at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748) > "ForkJoinPool-1-worker-57": > at java.lang.ClassLoader.loadClass(ClassLoader.java:404) > - waiting to lock <0x0005b7991168> (a > org.apache.spark.ut
[jira] [Resolved] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API
[ https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi resolved SPARK-27124. --- Resolution: Won't Do OK, based on the discussion it's won't do. At least users may have some clue based on this thread. > Expose org.apache.spark.sql.avro.SchemaConverters as developer API > -- > > Key: SPARK-27124 > URL: https://issues.apache.org/jira/browse/SPARK-27124 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to > convert schema between Spark SQL and avro. This is reachable from scala side > but not from pyspark. I suggest to add this as a developer API to ease > development for pyspark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27185) mapPartition to replace map to speedUp Dataset's toLocalIterator process
angerszhu created SPARK-27185: - Summary: mapPartition to replace map to speedUp Dataset's toLocalIterator process Key: SPARK-27185 URL: https://issues.apache.org/jira/browse/SPARK-27185 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 2.3.0, 2.2.0, 2.0.0 Reporter: angerszhu In my case, I will use DataSet's toLocalIterator function, and I found that underlying code can be improved,it can be changed from map to mapPartitionsInternal to speed Up the process of decode data to Internal Row -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23375) Optimizer should remove unneeded Sort
[ https://issues.apache.org/jira/browse/SPARK-23375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794805#comment-16794805 ] Xiaoju Wu commented on SPARK-23375: --- But one of your test cases is conflict with what I talked about above: test("sort should not be removed when there is a node which doesn't guarantee any order") { val orderedPlan = testRelation.select('a, 'b).orderBy('a.asc) val groupedAndResorted = orderedPlan.groupBy('a)(sum('a)).orderBy('a.asc) val optimized = Optimize.execute(groupedAndResorted.analyze) val correctAnswer = groupedAndResorted.analyze comparePlans(optimized, correctAnswer) } Why you design like this? In my opinion, since Aggregate won't pass up the ordering, the below Sort is useless. > Optimizer should remove unneeded Sort > - > > Key: SPARK-23375 > URL: https://issues.apache.org/jira/browse/SPARK-23375 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0 > > > As pointed out in SPARK-23368, as of now there is no rule to remove the Sort > operator on an already sorted plan, ie. if we have a query like: > {code} > SELECT b > FROM ( > SELECT a, b > FROM table1 > ORDER BY a > ) t > ORDER BY a > {code} > The sort is actually executed twice, even though it is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27184) Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in config object
[ https://issues.apache.org/jira/browse/SPARK-27184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27184: Assignee: (was: Apache Spark) > Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in > config object > > > Key: SPARK-27184 > URL: https://issues.apache.org/jira/browse/SPARK-27184 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: hehuiyuan >Priority: Minor > > In the org.apache.spark.internal.object,we define the variables of FILES and > JARS,we can use them instead of "spark.jars" and "spark.files". > private[spark] val JARS = ConfigBuilder("spark.jars") > .stringConf > .toSequence > .createWithDefault(Nil) > private[spark] val FILES = ConfigBuilder("spark.files") > .stringConf > .toSequence > .createWithDefault(Nil) > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27184) Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in config object
[ https://issues.apache.org/jira/browse/SPARK-27184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27184: Assignee: Apache Spark > Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in > config object > > > Key: SPARK-27184 > URL: https://issues.apache.org/jira/browse/SPARK-27184 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: hehuiyuan >Assignee: Apache Spark >Priority: Minor > > In the org.apache.spark.internal.object,we define the variables of FILES and > JARS,we can use them instead of "spark.jars" and "spark.files". > private[spark] val JARS = ConfigBuilder("spark.jars") > .stringConf > .toSequence > .createWithDefault(Nil) > private[spark] val FILES = ConfigBuilder("spark.files") > .stringConf > .toSequence > .createWithDefault(Nil) > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27184) Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in config object
hehuiyuan created SPARK-27184: - Summary: Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in config object Key: SPARK-27184 URL: https://issues.apache.org/jira/browse/SPARK-27184 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: hehuiyuan In the org.apache.spark.internal.object,we define the variables of FILES and JARS,we can use them instead of "spark.jars" and "spark.files". private[spark] val JARS = ConfigBuilder("spark.jars") .stringConf .toSequence .createWithDefault(Nil) private[spark] val FILES = ConfigBuilder("spark.files") .stringConf .toSequence .createWithDefault(Nil) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27184) Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in config object
[ https://issues.apache.org/jira/browse/SPARK-27184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hehuiyuan updated SPARK-27184: -- External issue URL: (was: https://aaa) > Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in > config object > > > Key: SPARK-27184 > URL: https://issues.apache.org/jira/browse/SPARK-27184 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: hehuiyuan >Priority: Minor > > In the org.apache.spark.internal.object,we define the variables of FILES and > JARS,we can use them instead of "spark.jars" and "spark.files". > private[spark] val JARS = ConfigBuilder("spark.jars") > .stringConf > .toSequence > .createWithDefault(Nil) > private[spark] val FILES = ConfigBuilder("spark.files") > .stringConf > .toSequence > .createWithDefault(Nil) > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org