[jira] [Updated] (SPARK-48608) Spark 3.5: fails to build with value defaultValueNotConstantError is not a member of object org.apache.spark.sql.errors.QueryCompilationErrors
[ https://issues.apache.org/jira/browse/SPARK-48608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-48608: -- Priority: Blocker (was: Major) > Spark 3.5: fails to build with value defaultValueNotConstantError is not a > member of object org.apache.spark.sql.errors.QueryCompilationErrors > --- > > Key: SPARK-48608 > URL: https://issues.apache.org/jira/browse/SPARK-48608 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.2 >Reporter: Thomas Graves >Priority: Blocker > > PR [https://github.com/apache/spark/pull/46594] seems to have broken the > Spark 3.5 build. > [ERROR] [Error] > ...sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala:299: > value defaultValueNotConstantError is not a member of object > org.apache.spark.sql.errors.QueryCompilationErrors > I don't see that definition defined on the 3.5 branch - > [https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala] > I see it defined on master by > https://issues.apache.org/jira/browse/SPARK-46905 which only went into 4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48608) Spark 3.5: fails to build with value defaultValueNotConstantError is not a member of object org.apache.spark.sql.errors.QueryCompilationErrors
Thomas Graves created SPARK-48608: - Summary: Spark 3.5: fails to build with value defaultValueNotConstantError is not a member of object org.apache.spark.sql.errors.QueryCompilationErrors Key: SPARK-48608 URL: https://issues.apache.org/jira/browse/SPARK-48608 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.2 Reporter: Thomas Graves PR [https://github.com/apache/spark/pull/46594] seems to have broken the Spark 3.5 build. [ERROR] [Error] ...sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala:299: value defaultValueNotConstantError is not a member of object org.apache.spark.sql.errors.QueryCompilationErrors I don't see that definition defined on the 3.5 branch - [https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala] I see it defined on master by https://issues.apache.org/jira/browse/SPARK-46905 which only went into 4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-47398. --- Fix Version/s: 4.0.0 3.5.2 Assignee: Raza Jafri Resolution: Fixed > AQE doesn't allow for extension of InMemoryTableScanExec > > > Key: SPARK-47398 > URL: https://issues.apache.org/jira/browse/SPARK-47398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 3.5.1 >Reporter: Raza Jafri >Assignee: Raza Jafri >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > > As part of SPARK-42101, we added support to AQE for handling > InMemoryTableScanExec. > This change directly references `InMemoryTableScanExec` which limits users > from extending the caching functionality that was added as part of > SPARK-32274 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47458) Incorrect to calculate the concurrent task number
[ https://issues.apache.org/jira/browse/SPARK-47458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-47458. --- Fix Version/s: 4.0.0 Assignee: Bobby Wang Resolution: Fixed > Incorrect to calculate the concurrent task number > - > > Key: SPARK-47458 > URL: https://issues.apache.org/jira/browse/SPARK-47458 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Bobby Wang >Assignee: Bobby Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > The below test case failed, > > {code:java} > test("problem of calculating the maximum concurrent task") { > withTempDir { dir => > val discoveryScript = createTempScriptWithExpectedOutput( > dir, "gpuDiscoveryScript", """{"name": "gpu","addresses":["0", "1", > "2", "3"]}""") > val conf = new SparkConf() > // Setup a local cluster which would only has one executor with 2 CPUs > and 1 GPU. > .setMaster("local-cluster[1, 6, 1024]") > .setAppName("test-cluster") > .set(WORKER_GPU_ID.amountConf, "4") > .set(WORKER_GPU_ID.discoveryScriptConf, discoveryScript) > .set(EXECUTOR_GPU_ID.amountConf, "4") > .set(TASK_GPU_ID.amountConf, "2") > // disable barrier stage retry to fail the application as soon as > possible > .set(BARRIER_MAX_CONCURRENT_TASKS_CHECK_MAX_FAILURES, 1) > sc = new SparkContext(conf) > TestUtils.waitUntilExecutorsUp(sc, 1, 6) > // Setup a barrier stage which contains 2 tasks and each task requires 1 > CPU and 1 GPU. > // Therefore, the total resources requirement (2 CPUs and 2 GPUs) of this > barrier stage > // can not be satisfied since the cluster only has 2 CPUs and 1 GPU in > total. > assert(sc.parallelize(Range(1, 10), 2) > .barrier() > .mapPartitions { iter => iter } > .collect() sameElements Range(1, 10).toArray[Int]) > } > } {code} > The error log > > > [SPARK-24819]: Barrier execution mode does not allow run a barrier stage that > requires more slots than the total number of slots in the cluster currently. > Please init a new cluster with more resources(e.g. CPU, GPU) or repartition > the input RDD(s) to reduce the number of slots required to run this barrier > stage. > org.apache.spark.scheduler.BarrierJobSlotsNumberCheckFailed: [SPARK-24819]: > Barrier execution mode does not allow run a barrier stage that requires more > slots than the total number of slots in the cluster currently. Please init a > new cluster with more resources(e.g. CPU, GPU) or repartition the input > RDD(s) to reduce the number of slots required to run this barrier stage. > at > org.apache.spark.errors.SparkCoreErrors$.numPartitionsGreaterThanMaxNumConcurrentTasksError(SparkCoreErrors.scala:241) > at > org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithNumSlots(DAGScheduler.scala:576) > at > org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:654) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1321) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3055) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3046) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3035) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47208) Allow overriding base overhead memory
[ https://issues.apache.org/jira/browse/SPARK-47208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-47208: - Assignee: Joao Correia > Allow overriding base overhead memory > - > > Key: SPARK-47208 > URL: https://issues.apache.org/jira/browse/SPARK-47208 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core, YARN >Affects Versions: 3.5.1 >Reporter: Joao Correia >Assignee: Joao Correia >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We can already select the desired overhead memory directly via the > _'spark.driver/executor.memoryOverhead'_ flags, however, if that flag is not > present the overhead memory calculation goes as follows: > {code:java} > overhead_memory = Max(384, 'spark.driver/executor.memory' * > 'spark.driver/executor.memoryOverheadFactor') > where the 'memoryOverheadFactor' flag defaults to 0.1{code} > There are certain times where being able to override the 384Mb minimum > directly can be beneficial. We may have a scenario where a lot of off-heap > operations are performed (ex: using package managers/native > compression/decompression) where we don't have a need for a large JVM heap > but we may still need a signficant amount of memory in the spark node. > Using the '{_}memoryOverheadFactor{_}' flag may not prove appropriate. Since > we may not want the overhead allocation to directly scale with JVM memory, as > a cost saving/resource limitation problem. > As such, I propose the addition of a > 'spark.driver/executor.minMemoryOverhead' flag, which can be used to override > the 384Mib value used in the overhead calculation. > The memory overhead calculation will now be : > {code:java} > min_memory = > sparkConf.get('spark.driver/executor.minMemoryOverhead').getOrElse(384) > overhead_memory = Max(min_memory, 'spark.driver/executor.memory' * > 'spark.driver/executor.memoryOverheadFactor'){code} > PR: https://github.com/apache/spark/pull/45240 > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47208) Allow overriding base overhead memory
[ https://issues.apache.org/jira/browse/SPARK-47208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-47208. --- Fix Version/s: 4.0.0 Resolution: Fixed > Allow overriding base overhead memory > - > > Key: SPARK-47208 > URL: https://issues.apache.org/jira/browse/SPARK-47208 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core, YARN >Affects Versions: 3.5.1 >Reporter: Joao Correia >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We can already select the desired overhead memory directly via the > _'spark.driver/executor.memoryOverhead'_ flags, however, if that flag is not > present the overhead memory calculation goes as follows: > {code:java} > overhead_memory = Max(384, 'spark.driver/executor.memory' * > 'spark.driver/executor.memoryOverheadFactor') > where the 'memoryOverheadFactor' flag defaults to 0.1{code} > There are certain times where being able to override the 384Mb minimum > directly can be beneficial. We may have a scenario where a lot of off-heap > operations are performed (ex: using package managers/native > compression/decompression) where we don't have a need for a large JVM heap > but we may still need a signficant amount of memory in the spark node. > Using the '{_}memoryOverheadFactor{_}' flag may not prove appropriate. Since > we may not want the overhead allocation to directly scale with JVM memory, as > a cost saving/resource limitation problem. > As such, I propose the addition of a > 'spark.driver/executor.minMemoryOverhead' flag, which can be used to override > the 384Mib value used in the overhead calculation. > The memory overhead calculation will now be : > {code:java} > min_memory = > sparkConf.get('spark.driver/executor.minMemoryOverhead').getOrElse(384) > overhead_memory = Max(min_memory, 'spark.driver/executor.memory' * > 'spark.driver/executor.memoryOverheadFactor'){code} > PR: https://github.com/apache/spark/pull/45240 > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45527) Task fraction resource request is not expected
[ https://issues.apache.org/jira/browse/SPARK-45527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821279#comment-17821279 ] Thomas Graves commented on SPARK-45527: --- Note that this is related to SPARK-39853 which was supposed to implement stage level scheduling with dynamic allocation disabled. That pr did not properly handle resources (gpu, fpga, etc) > Task fraction resource request is not expected > -- > > Key: SPARK-45527 > URL: https://issues.apache.org/jira/browse/SPARK-45527 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1, 3.3.3, 3.4.1, 3.5.0 >Reporter: wuyi >Assignee: Bobby Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > > {code:java} > test("SPARK-XXX") { > import org.apache.spark.resource.{ResourceProfileBuilder, > TaskResourceRequests} > withTempDir { dir => > val scriptPath = createTempScriptWithExpectedOutput(dir, > "gpuDiscoveryScript", > """{"name": "gpu","addresses":["0"]}""") > val conf = new SparkConf() > .setAppName("test") > .setMaster("local-cluster[1, 12, 1024]") > .set("spark.executor.cores", "12") > conf.set(TASK_GPU_ID.amountConf, "0.08") > conf.set(WORKER_GPU_ID.amountConf, "1") > conf.set(WORKER_GPU_ID.discoveryScriptConf, scriptPath) > conf.set(EXECUTOR_GPU_ID.amountConf, "1") > sc = new SparkContext(conf) > val rdd = sc.range(0, 100, 1, 4) > var rdd1 = rdd.repartition(3) > val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 1.0) > val rp = new ResourceProfileBuilder().require(treqs).build > rdd1 = rdd1.withResources(rp) > assert(rdd1.collect().size === 100) > } > } {code} > In the above test, the 3 tasks generated by rdd1 are expected to be executed > in sequence as we expect "new TaskResourceRequests().cpus(1).resource("gpu", > 1.0)" should override "conf.set(TASK_GPU_ID.amountConf, "0.08")". However, > those 3 tasks are run in parallel in fact. > The root cause is that ExecutorData#ExecutorResourceInfo#numParts is static. > In this case, the "gpu.numParts" is initialized with 12 (1/0.08) and won't > change even if there's a new task resource request (e.g., resource("gpu", > 1.0) in this case). Thus, those 3 tasks are able to be executed in parallel. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45527) Task fraction resource request is not expected
[ https://issues.apache.org/jira/browse/SPARK-45527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-45527. --- Resolution: Fixed > Task fraction resource request is not expected > -- > > Key: SPARK-45527 > URL: https://issues.apache.org/jira/browse/SPARK-45527 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1, 3.3.3, 3.4.1, 3.5.0 >Reporter: wuyi >Assignee: Bobby Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > > {code:java} > test("SPARK-XXX") { > import org.apache.spark.resource.{ResourceProfileBuilder, > TaskResourceRequests} > withTempDir { dir => > val scriptPath = createTempScriptWithExpectedOutput(dir, > "gpuDiscoveryScript", > """{"name": "gpu","addresses":["0"]}""") > val conf = new SparkConf() > .setAppName("test") > .setMaster("local-cluster[1, 12, 1024]") > .set("spark.executor.cores", "12") > conf.set(TASK_GPU_ID.amountConf, "0.08") > conf.set(WORKER_GPU_ID.amountConf, "1") > conf.set(WORKER_GPU_ID.discoveryScriptConf, scriptPath) > conf.set(EXECUTOR_GPU_ID.amountConf, "1") > sc = new SparkContext(conf) > val rdd = sc.range(0, 100, 1, 4) > var rdd1 = rdd.repartition(3) > val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 1.0) > val rp = new ResourceProfileBuilder().require(treqs).build > rdd1 = rdd1.withResources(rp) > assert(rdd1.collect().size === 100) > } > } {code} > In the above test, the 3 tasks generated by rdd1 are expected to be executed > in sequence as we expect "new TaskResourceRequests().cpus(1).resource("gpu", > 1.0)" should override "conf.set(TASK_GPU_ID.amountConf, "0.08")". However, > those 3 tasks are run in parallel in fact. > The root cause is that ExecutorData#ExecutorResourceInfo#numParts is static. > In this case, the "gpu.numParts" is initialized with 12 (1/0.08) and won't > change even if there's a new task resource request (e.g., resource("gpu", > 1.0) in this case). Thus, those 3 tasks are able to be executed in parallel. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45527) Task fraction resource request is not expected
[ https://issues.apache.org/jira/browse/SPARK-45527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-45527: -- Fix Version/s: 4.0.0 > Task fraction resource request is not expected > -- > > Key: SPARK-45527 > URL: https://issues.apache.org/jira/browse/SPARK-45527 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1, 3.3.3, 3.4.1, 3.5.0 >Reporter: wuyi >Assignee: Bobby Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > > {code:java} > test("SPARK-XXX") { > import org.apache.spark.resource.{ResourceProfileBuilder, > TaskResourceRequests} > withTempDir { dir => > val scriptPath = createTempScriptWithExpectedOutput(dir, > "gpuDiscoveryScript", > """{"name": "gpu","addresses":["0"]}""") > val conf = new SparkConf() > .setAppName("test") > .setMaster("local-cluster[1, 12, 1024]") > .set("spark.executor.cores", "12") > conf.set(TASK_GPU_ID.amountConf, "0.08") > conf.set(WORKER_GPU_ID.amountConf, "1") > conf.set(WORKER_GPU_ID.discoveryScriptConf, scriptPath) > conf.set(EXECUTOR_GPU_ID.amountConf, "1") > sc = new SparkContext(conf) > val rdd = sc.range(0, 100, 1, 4) > var rdd1 = rdd.repartition(3) > val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 1.0) > val rp = new ResourceProfileBuilder().require(treqs).build > rdd1 = rdd1.withResources(rp) > assert(rdd1.collect().size === 100) > } > } {code} > In the above test, the 3 tasks generated by rdd1 are expected to be executed > in sequence as we expect "new TaskResourceRequests().cpus(1).resource("gpu", > 1.0)" should override "conf.set(TASK_GPU_ID.amountConf, "0.08")". However, > those 3 tasks are run in parallel in fact. > The root cause is that ExecutorData#ExecutorResourceInfo#numParts is static. > In this case, the "gpu.numParts" is initialized with 12 (1/0.08) and won't > change even if there's a new task resource request (e.g., resource("gpu", > 1.0) in this case). Thus, those 3 tasks are able to be executed in parallel. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45527) Task fraction resource request is not expected
[ https://issues.apache.org/jira/browse/SPARK-45527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-45527: - Assignee: Bobby Wang > Task fraction resource request is not expected > -- > > Key: SPARK-45527 > URL: https://issues.apache.org/jira/browse/SPARK-45527 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1, 3.3.3, 3.4.1, 3.5.0 >Reporter: wuyi >Assignee: Bobby Wang >Priority: Major > Labels: pull-request-available > > > {code:java} > test("SPARK-XXX") { > import org.apache.spark.resource.{ResourceProfileBuilder, > TaskResourceRequests} > withTempDir { dir => > val scriptPath = createTempScriptWithExpectedOutput(dir, > "gpuDiscoveryScript", > """{"name": "gpu","addresses":["0"]}""") > val conf = new SparkConf() > .setAppName("test") > .setMaster("local-cluster[1, 12, 1024]") > .set("spark.executor.cores", "12") > conf.set(TASK_GPU_ID.amountConf, "0.08") > conf.set(WORKER_GPU_ID.amountConf, "1") > conf.set(WORKER_GPU_ID.discoveryScriptConf, scriptPath) > conf.set(EXECUTOR_GPU_ID.amountConf, "1") > sc = new SparkContext(conf) > val rdd = sc.range(0, 100, 1, 4) > var rdd1 = rdd.repartition(3) > val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 1.0) > val rp = new ResourceProfileBuilder().require(treqs).build > rdd1 = rdd1.withResources(rp) > assert(rdd1.collect().size === 100) > } > } {code} > In the above test, the 3 tasks generated by rdd1 are expected to be executed > in sequence as we expect "new TaskResourceRequests().cpus(1).resource("gpu", > 1.0)" should override "conf.set(TASK_GPU_ID.amountConf, "0.08")". However, > those 3 tasks are run in parallel in fact. > The root cause is that ExecutorData#ExecutorResourceInfo#numParts is static. > In this case, the "gpu.numParts" is initialized with 12 (1/0.08) and won't > change even if there's a new task resource request (e.g., resource("gpu", > 1.0) in this case). Thus, those 3 tasks are able to be executed in parallel. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40129) Decimal multiply can produce the wrong answer because it rounds twice
[ https://issues.apache.org/jira/browse/SPARK-40129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790107#comment-17790107 ] Thomas Graves commented on SPARK-40129: --- this looks like a dup of https://issues.apache.org/jira/browse/SPARK-45786 > Decimal multiply can produce the wrong answer because it rounds twice > - > > Key: SPARK-40129 > URL: https://issues.apache.org/jira/browse/SPARK-40129 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Robert Joseph Evans >Priority: Major > Labels: pull-request-available > > This looks like it has been around for a long time, but I have reproduced it > in 3.2.0+ > The example here is multiplying Decimal(38, 10) by another Decimal(38, 10), > but I think it can be reproduced with other number combinations, and possibly > with divide too. > {code:java} > Seq("9173594185998001607642838421.5479932913").toDF.selectExpr("CAST(value as > DECIMAL(38,10)) as a").selectExpr("a * CAST(-12 as > DECIMAL(38,10))").show(truncate=false) > {code} > This produces an answer in Spark of > {{-110083130231976019291714061058.575920}} But if I do the calculation in > regular java BigDecimal I get {{-110083130231976019291714061058.575919}} > {code:java} > BigDecimal l = new BigDecimal("9173594185998001607642838421.5479932913"); > BigDecimal r = new BigDecimal("-12.00"); > BigDecimal prod = l.multiply(r); > BigDecimal rounded_prod = prod.setScale(6, RoundingMode.HALF_UP); > {code} > Spark does essentially all of the same operations, but it used Decimal to do > it instead of java's BigDecimal directly. Spark, by way of Decimal, will set > a MathContext for the multiply operation that has a max precision of 38 and > will do half up rounding. That means that the result of the multiply > operation in Spark is {{{}-110083130231976019291714061058.57591950{}}}, but > for the java BigDecimal code the result is > {{{}-110083130231976019291714061058.575919495600{}}}. Then in > CheckOverflow for 3.2.0 and 3.3.0 or in just the regular Multiply expression > in 3.4.0 the setScale is called (as a part of Decimal.setPrecision). At that > point the already rounded number is rounded yet again resulting in what is > arguably a wrong answer by Spark. > I have not fully tested this, but it looks like we could just remove the > MathContext entirely in Decimal, or set it to UNLIMITED. All of the decimal > operations appear to have their own overflow and rounding anyways. If we want > to potentially reduce the total memory usage, we could also set the max > precision to 39 and truncate (round down) the result in the math context > instead. That would then let us round the result correctly in setPrecision > afterwards. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45937) Fix documentation of spark.executor.maxNumFailures
[ https://issues.apache.org/jira/browse/SPARK-45937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-45937. --- Resolution: Duplicate > Fix documentation of spark.executor.maxNumFailures > -- > > Key: SPARK-45937 > URL: https://issues.apache.org/jira/browse/SPARK-45937 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Thomas Graves >Priority: Critical > > https://issues.apache.org/jira/browse/SPARK-41210 added support for > spark.executor.maxNumFailures on Kubernetes, it made this config generic and > deprecated the yarn version. This config isn't documented and defaults are > not documented. > > [https://github.com/apache/spark/commit/40872e9a094f8459b0b6f626937ced48a8d98efb] > \ > It also added {color:#0a3069}spark.executor.failuresValidityInterval.{color} > > {color:#0a3069}Both need to have default values specified for yarn and k8s, > it also needs to remove the yarn documentation for equivalent configs > spark.yarn.max.executor.failures configuration{color} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45937) Fix documentation of spark.executor.maxNumFailures
[ https://issues.apache.org/jira/browse/SPARK-45937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786395#comment-17786395 ] Thomas Graves commented on SPARK-45937: --- @Cheng Pan Could you fix this as followup? > Fix documentation of spark.executor.maxNumFailures > -- > > Key: SPARK-45937 > URL: https://issues.apache.org/jira/browse/SPARK-45937 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Thomas Graves >Priority: Critical > > https://issues.apache.org/jira/browse/SPARK-41210 added support for > spark.executor.maxNumFailures on Kubernetes, it made this config generic and > deprecated the yarn version. This config isn't documented and defaults are > not documented. > > [https://github.com/apache/spark/commit/40872e9a094f8459b0b6f626937ced48a8d98efb] > \ > It also added {color:#0a3069}spark.executor.failuresValidityInterval.{color} > > {color:#0a3069}Both need to have default values specified for yarn and k8s, > it also needs to remove the yarn documentation for equivalent configs > spark.yarn.max.executor.failures configuration{color} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45937) Fix documentation of spark.executor.maxNumFailures
Thomas Graves created SPARK-45937: - Summary: Fix documentation of spark.executor.maxNumFailures Key: SPARK-45937 URL: https://issues.apache.org/jira/browse/SPARK-45937 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0 Reporter: Thomas Graves https://issues.apache.org/jira/browse/SPARK-41210 added support for spark.executor.maxNumFailures on Kubernetes, it made this config generic and deprecated the yarn version. This config isn't documented and defaults are not documented. [https://github.com/apache/spark/commit/40872e9a094f8459b0b6f626937ced48a8d98efb] \ It also added {color:#0a3069}spark.executor.failuresValidityInterval.{color} {color:#0a3069}Both need to have default values specified for yarn and k8s, it also needs to remove the yarn documentation for equivalent configs spark.yarn.max.executor.failures configuration{color} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45495) Support stage level task resource profile for k8s cluster when dynamic allocation disabled
[ https://issues.apache.org/jira/browse/SPARK-45495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-45495. --- Resolution: Fixed > Support stage level task resource profile for k8s cluster when dynamic > allocation disabled > -- > > Key: SPARK-45495 > URL: https://issues.apache.org/jira/browse/SPARK-45495 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Bobby Wang >Assignee: Bobby Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.1 > > > [https://github.com/apache/spark/pull/37268] has introduced a new feature > that supports stage-level schedule task resource profile for standalone > cluster when dynamic allocation is disabled. It's really cool feature, > especially for ML/DL cases, more details can be found in that PR. > > The problem here is that the feature is only available for standalone and > YARN cluster for now, but most users would also expect it can be used for > other spark clusters like K8s. > > So I filed this issue to track this task. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45527) Task fraction resource request is not expected
[ https://issues.apache.org/jira/browse/SPARK-45527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774957#comment-17774957 ] Thomas Graves commented on SPARK-45527: --- thanks for filing and digging into this. I assume this is only with the TaskResourceRequests and using the default ExecutorResourceRequests. seems a bug since that functionality was added. Either way when we fix should add tests similar if we can. > Task fraction resource request is not expected > -- > > Key: SPARK-45527 > URL: https://issues.apache.org/jira/browse/SPARK-45527 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1, 3.3.3, 3.4.1, 3.5.0 >Reporter: wuyi >Priority: Major > > > {code:java} > test("SPARK-XXX") { > import org.apache.spark.resource.{ResourceProfileBuilder, > TaskResourceRequests} > withTempDir { dir => > val scriptPath = createTempScriptWithExpectedOutput(dir, > "gpuDiscoveryScript", > """{"name": "gpu","addresses":["0"]}""") > val conf = new SparkConf() > .setAppName("test") > .setMaster("local-cluster[1, 12, 1024]") > .set("spark.executor.cores", "12") > conf.set(TASK_GPU_ID.amountConf, "0.08") > conf.set(WORKER_GPU_ID.amountConf, "1") > conf.set(WORKER_GPU_ID.discoveryScriptConf, scriptPath) > conf.set(EXECUTOR_GPU_ID.amountConf, "1") > sc = new SparkContext(conf) > val rdd = sc.range(0, 100, 1, 4) > var rdd1 = rdd.repartition(3) > val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 1.0) > val rp = new ResourceProfileBuilder().require(treqs).build > rdd1 = rdd1.withResources(rp) > assert(rdd1.collect().size === 100) > } > } {code} > In the above test, the 3 tasks generated by rdd1 are expected to be executed > in sequence as we expect "new TaskResourceRequests().cpus(1).resource("gpu", > 1.0)" should override "conf.set(TASK_GPU_ID.amountConf, "0.08")". However, > those 3 tasks are run in parallel in fact. > The root cause is that ExecutorData#ExecutorResourceInfo#numParts is static. > In this case, the "gpu.numParts" is initialized with 12 (1/0.08) and won't > change even if there's a new task resource request (e.g., resource("gpu", > 1.0) in this case). Thus, those 3 tasks are able to be executed in parallel. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45250) Support stage level task resource profile for yarn cluster when dynamic allocation disabled
[ https://issues.apache.org/jira/browse/SPARK-45250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-45250: -- Fix Version/s: 3.5.1 > Support stage level task resource profile for yarn cluster when dynamic > allocation disabled > --- > > Key: SPARK-45250 > URL: https://issues.apache.org/jira/browse/SPARK-45250 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Bobby Wang >Assignee: Bobby Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.1 > > > [https://github.com/apache/spark/pull/37268] has introduced a new feature > that supports stage-level schedule task resource profile for standalone > cluster when dynamic allocation is disabled. It's really cool feature, > especially for ML/DL cases, more details can be found in that PR. > > The problem here is that the feature is only available for standalone cluster > for now, but most users would also expect it can be used for other spark > clusters like yarn and k8s. > > So I file this issue to track this task. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
[ https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-44940: -- Fix Version/s: 3.5.0 (was: 3.5.1) > Improve performance of JSON parsing when > "spark.sql.json.enablePartialResults" is enabled > - > > Key: SPARK-44940 > URL: https://issues.apache.org/jira/browse/SPARK-44940 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Labels: correctness, pull-request-available > Fix For: 3.4.2, 3.5.0 > > > Follow-up on https://issues.apache.org/jira/browse/SPARK-40646. > I found that JSON parsing is significantly slower due to exception creation > in control flow. Also, some fields are not parsed correctly and the exception > is thrown in certain cases: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590) > ... 39 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
[ https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769338#comment-17769338 ] Thomas Graves commented on SPARK-44940: --- I noticed this went into 3.5.0 ([https://github.com/apache/spark/commits/v3.5.0)] so updating the fixed versions. > Improve performance of JSON parsing when > "spark.sql.json.enablePartialResults" is enabled > - > > Key: SPARK-44940 > URL: https://issues.apache.org/jira/browse/SPARK-44940 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Labels: correctness, pull-request-available > Fix For: 3.4.2, 3.5.1 > > > Follow-up on https://issues.apache.org/jira/browse/SPARK-40646. > I found that JSON parsing is significantly slower due to exception creation > in control flow. Also, some fields are not parsed correctly and the exception > is thrown in certain cases: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590) > ... 39 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43919) Extract JSON functionality out of Row
[ https://issues.apache.org/jira/browse/SPARK-43919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766400#comment-17766400 ] Thomas Graves commented on SPARK-43919: --- This is missing description, comments, and link to the pr, I don't understand how this can be resolved without any of those. Doing some searching seems: [https://github.com/apache/spark/pull/41425] [~hvanhovell] please make sure proper linkage before resolving. > Extract JSON functionality out of Row > -- > > Key: SPARK-43919 > URL: https://issues.apache.org/jira/browse/SPARK-43919 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44284) Introduce simpe conf system for sql/api
[ https://issues.apache.org/jira/browse/SPARK-44284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762102#comment-17762102 ] Thomas Graves commented on SPARK-44284: --- Can we get a description on this? This seems like a fairly significant change for a one line without description here or in the pr. > Introduce simpe conf system for sql/api > --- > > Key: SPARK-44284 > URL: https://issues.apache.org/jira/browse/SPARK-44284 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0 > > > Create a simple conf system for classes in sql/api -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44144) Enable `spark.authenticate` by default in K8s environment
[ https://issues.apache.org/jira/browse/SPARK-44144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759061#comment-17759061 ] Thomas Graves commented on SPARK-44144: --- I'm not necessarily against this but it seems odd to do for one resource manager and not others, especially like YARN where the same is true that it automatically generates secrets. Its also inconsistent with what we have done in the past with essentially auth off by default. Does this affect performance for instance? > Enable `spark.authenticate` by default in K8s environment > - > > Key: SPARK-44144 > URL: https://issues.apache.org/jira/browse/SPARK-44144 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Apache Spark supports spark.authenticate and spark.authenticate.secret since > 1.0.0. > This issue proposes to set `spark.authenticate=true` simply in K8s > environment. There is no other required change because Spark will > automatically generate an authentication secret unique to each application > for a little improved isolation and security per applications in K8s > environment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44144) Enable `spark.authenticate` by default in K8s environment
[ https://issues.apache.org/jira/browse/SPARK-44144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758602#comment-17758602 ] Thomas Graves commented on SPARK-44144: --- Can you add a description on this? Why do we want this on by default? > Enable `spark.authenticate` by default in K8s environment > - > > Key: SPARK-44144 > URL: https://issues.apache.org/jira/browse/SPARK-44144 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44871) Fix PERCENTILE_DISC behaviour
[ https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756077#comment-17756077 ] Thomas Graves commented on SPARK-44871: --- Can you add a description to this please > Fix PERCENTILE_DISC behaviour > - > > Key: SPARK-44871 > URL: https://issues.apache.org/jira/browse/SPARK-44871 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0, 4.0.0 >Reporter: Peter Toth >Priority: Critical > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44134) Can't set resources (GPU/FPGA) to 0 when they are set to positive value in spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-44134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-44134: -- Fix Version/s: 3.4.2 (was: 3.4.1) > Can't set resources (GPU/FPGA) to 0 when they are set to positive value in > spark-defaults.conf > -- > > Key: SPARK-44134 > URL: https://issues.apache.org/jira/browse/SPARK-44134 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.3.3, 3.5.0, 3.4.2 > > > With resource aware scheduling, if you specify a default value in the > spark-defaults.conf, a user can't override that to set it to 0. > Meaning spark-defaults.conf has something like: > {{spark.executor.resource.\{resourceName}.amount=1}} > {{spark.task.resource.\{resourceName}.amount}} =1 > If the user tries to override when submitting an application with > {{{}spark.executor.resource.\{resourceName}.amount{}}}=0 and > {{spark.task.resource.\{resourceName}.amount}} =0, it gives the user an error: > > {code:java} > 23/06/21 09:12:57 ERROR Main: Failed to initialize Spark session. > org.apache.spark.SparkException: No executor resource configs were not > specified for the following task configs: gpu > at > org.apache.spark.resource.ResourceProfile.calculateTasksAndLimitingResource(ResourceProfile.scala:206) > at > org.apache.spark.resource.ResourceProfile.$anonfun$limitingResource$1(ResourceProfile.scala:139) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.resource.ResourceProfile.limitingResource(ResourceProfile.scala:138) > at > org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:95) > at > org.apache.spark.resource.ResourceProfileManager.(ResourceProfileManager.scala:49) > at org.apache.spark.SparkContext.(SparkContext.scala:455) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) > at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953){code} > This used to work, my guess is this may have gotten broken with the stage > level scheduling feature. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44134) Can't set resources (GPU/FPGA) to 0 when they are set to positive value in spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-44134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-44134: -- Description: With resource aware scheduling, if you specify a default value in the spark-defaults.conf, a user can't override that to set it to 0. Meaning spark-defaults.conf has something like: {{spark.executor.resource.\{resourceName}.amount=1}} {{spark.task.resource.\{resourceName}.amount}} =1 If the user tries to override when submitting an application with {{{}spark.executor.resource.\{resourceName}.amount{}}}=0 and {{spark.task.resource.\{resourceName}.amount}} =0, it gives the user an error: {code:java} 23/06/21 09:12:57 ERROR Main: Failed to initialize Spark session. org.apache.spark.SparkException: No executor resource configs were not specified for the following task configs: gpu at org.apache.spark.resource.ResourceProfile.calculateTasksAndLimitingResource(ResourceProfile.scala:206) at org.apache.spark.resource.ResourceProfile.$anonfun$limitingResource$1(ResourceProfile.scala:139) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.resource.ResourceProfile.limitingResource(ResourceProfile.scala:138) at org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:95) at org.apache.spark.resource.ResourceProfileManager.(ResourceProfileManager.scala:49) at org.apache.spark.SparkContext.(SparkContext.scala:455) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953){code} This used to work, my guess is this may have gotten broken with the stage level scheduling feature. was: With resource aware scheduling, if you specify a default value in the spark-defaults.conf, a user can't override that to set it to 0. Meaning spark-defaults.conf has something like: {{spark.executor.resource.\{resourceName}.amount=1}} {{spark.task.resource.\{resourceName}.amount}} =1 {{}} If the user tries to override when submitting an application with {{{}spark.executor.resource.\{resourceName}.amount{}}}=0 and {{{}{}}}{{{}spark.task.resource.\{resourceName}.amount{}}}{{ =0, it gives the user an error:}} {{}} {code:java} 23/06/21 09:12:57 ERROR Main: Failed to initialize Spark session. org.apache.spark.SparkException: No executor resource configs were not specified for the following task configs: gpu at org.apache.spark.resource.ResourceProfile.calculateTasksAndLimitingResource(ResourceProfile.scala:206) at org.apache.spark.resource.ResourceProfile.$anonfun$limitingResource$1(ResourceProfile.scala:139) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.resource.ResourceProfile.limitingResource(ResourceProfile.scala:138) at org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:95) at org.apache.spark.resource.ResourceProfileManager.(ResourceProfileManager.scala:49) at org.apache.spark.SparkContext.(SparkContext.scala:455) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953){code} This used to work, my guess is this may have gotten broken with the stage level scheduling feature. > Can't set resources (GPU/FPGA) to 0 when they are set to positive value in > spark-defaults.conf > -- > > Key: SPARK-44134 > URL: https://issues.apache.org/jira/browse/SPARK-44134 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Thomas Graves >Priority: Major > > With resource aware scheduling, if you specify a default value in the > spark-defaults.conf, a user can't override that to set it to 0. > Meaning spark-defaults.conf has something like: > {{spark.executor.resource.\{resourceName}.amount=1}} > {{spark.task.resource.\{resourceName}.amount}} =1 > If the user tries to override when submitting an application with > {{{}spark.executor.resource.\{resourceName}.amount{}}}=0 and > {{spark.task.resource.\{resourceName}.amount}} =0, it gives the user an error: > > {code:java} > 23/06/21 09:12:57 ERROR Main: Failed to initialize Spark session. > org.apache.spark.SparkException: No executor resource configs were not > specified for the following task configs: gpu > at > org.apache.spark.resource.ResourceProfile.calculateTasksAndLimitingResource(ResourceProfile.scala:206) > at > org.apache.spark.resource.ResourceProfile.$anonfun$limitingResource$1(ResourceProfile.scala:139) > at scala.Option.getOrElse(Opt
[jira] [Created] (SPARK-44134) Can't set resources (GPU/FPGA) to 0 when they are set to positive value in spark-defaults.conf
Thomas Graves created SPARK-44134: - Summary: Can't set resources (GPU/FPGA) to 0 when they are set to positive value in spark-defaults.conf Key: SPARK-44134 URL: https://issues.apache.org/jira/browse/SPARK-44134 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0 Reporter: Thomas Graves With resource aware scheduling, if you specify a default value in the spark-defaults.conf, a user can't override that to set it to 0. Meaning spark-defaults.conf has something like: {{spark.executor.resource.\{resourceName}.amount=1}} {{spark.task.resource.\{resourceName}.amount}} =1 {{}} If the user tries to override when submitting an application with {{{}spark.executor.resource.\{resourceName}.amount{}}}=0 and {{{}{}}}{{{}spark.task.resource.\{resourceName}.amount{}}}{{ =0, it gives the user an error:}} {{}} {code:java} 23/06/21 09:12:57 ERROR Main: Failed to initialize Spark session. org.apache.spark.SparkException: No executor resource configs were not specified for the following task configs: gpu at org.apache.spark.resource.ResourceProfile.calculateTasksAndLimitingResource(ResourceProfile.scala:206) at org.apache.spark.resource.ResourceProfile.$anonfun$limitingResource$1(ResourceProfile.scala:139) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.resource.ResourceProfile.limitingResource(ResourceProfile.scala:138) at org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:95) at org.apache.spark.resource.ResourceProfileManager.(ResourceProfileManager.scala:49) at org.apache.spark.SparkContext.(SparkContext.scala:455) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953){code} This used to work, my guess is this may have gotten broken with the stage level scheduling feature. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44134) Can't set resources (GPU/FPGA) to 0 when they are set to positive value in spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-44134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735746#comment-17735746 ] Thomas Graves commented on SPARK-44134: --- I'm working on a fix for this > Can't set resources (GPU/FPGA) to 0 when they are set to positive value in > spark-defaults.conf > -- > > Key: SPARK-44134 > URL: https://issues.apache.org/jira/browse/SPARK-44134 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Thomas Graves >Priority: Major > > With resource aware scheduling, if you specify a default value in the > spark-defaults.conf, a user can't override that to set it to 0. > Meaning spark-defaults.conf has something like: > {{spark.executor.resource.\{resourceName}.amount=1}} > {{spark.task.resource.\{resourceName}.amount}} =1 > {{}} > If the user tries to override when submitting an application with > {{{}spark.executor.resource.\{resourceName}.amount{}}}=0 and > {{{}{}}}{{{}spark.task.resource.\{resourceName}.amount{}}}{{ =0, it gives the > user an error:}} > {{}} > {code:java} > 23/06/21 09:12:57 ERROR Main: Failed to initialize Spark session. > org.apache.spark.SparkException: No executor resource configs were not > specified for the following task configs: gpu > at > org.apache.spark.resource.ResourceProfile.calculateTasksAndLimitingResource(ResourceProfile.scala:206) > at > org.apache.spark.resource.ResourceProfile.$anonfun$limitingResource$1(ResourceProfile.scala:139) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.resource.ResourceProfile.limitingResource(ResourceProfile.scala:138) > at > org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:95) > at > org.apache.spark.resource.ResourceProfileManager.(ResourceProfileManager.scala:49) > at org.apache.spark.SparkContext.(SparkContext.scala:455) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) > at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953){code} > This used to work, my guess is this may have gotten broken with the stage > level scheduling feature. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43510) Spark application hangs when YarnAllocator adds running executors after processing completed containers
[ https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-43510. --- Fix Version/s: 3.4.1 3.5.0 Assignee: Manu Zhang Resolution: Fixed > Spark application hangs when YarnAllocator adds running executors after > processing completed containers > --- > > Key: SPARK-43510 > URL: https://issues.apache.org/jira/browse/SPARK-43510 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.4.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Major > Fix For: 3.4.1, 3.5.0 > > > I see application hangs when containers are preempted immediately after > allocation as follows. > {code:java} > 23/05/14 09:11:33 INFO YarnAllocator: Launching container > container_e3812_1684033797982_57865_01_000382 on host > hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with > ID 277 for ResourceProfile Id 0 > 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: > container_e3812_1684033797982_57865_01_000382 > 23/05/14 09:11:33 INFO YarnAllocator: Completed container > container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: > -102) > 23/05/14 09:11:33 INFO YarnAllocator: Container > container_e3812_1684033797982_57865_01_000382 was preempted.{code} > Note the warning log where YarnAllocator cannot find executorId for the > container when processing completed containers. The only plausible cause is > YarnAllocator added the running executor after processing completed > containers. The former happens in a separate thread after executor launch. > YarnAllocator believes there are still running executors, although they are > already lost due to preemption. Hence, the application hangs without any > running executors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41660) only propagate metadata columns if they are used
[ https://issues.apache.org/jira/browse/SPARK-41660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726740#comment-17726740 ] Thomas Graves commented on SPARK-41660: --- it looks like this was backported to 3.3. with https://github.com/apache/spark/pull/40889 > only propagate metadata columns if they are used > > > Key: SPARK-41660 > URL: https://issues.apache.org/jira/browse/SPARK-41660 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.3.3, 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41660) only propagate metadata columns if they are used
[ https://issues.apache.org/jira/browse/SPARK-41660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-41660: -- Fix Version/s: 3.3.3 > only propagate metadata columns if they are used > > > Key: SPARK-41660 > URL: https://issues.apache.org/jira/browse/SPARK-41660 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.3.3, 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43340) JsonProtocol is not backward compatible
[ https://issues.apache.org/jira/browse/SPARK-43340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718601#comment-17718601 ] Thomas Graves commented on SPARK-43340: --- Likely related to SPARK-39489 > JsonProtocol is not backward compatible > --- > > Key: SPARK-43340 > URL: https://issues.apache.org/jira/browse/SPARK-43340 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0, 3.5.0 >Reporter: Ahmed Hussein >Priority: Major > Fix For: 3.4.1, 3.5.0 > > > Recently I was testing with some 3.0.2 eventlogs. > The SHS-3.4+ does not interpret failed jobs/ failed SQLs correctly. > Instead it will list them as "Incomplete/Active" whereas it should be listed > as "Failed". > The problem is due to missing fields in eventlogs generated by previous > versions. In this case the eventlog does not have "Stack Trace" field which > causes a NPE > > > > {code:java} > {"Event":"SparkListenerJobEnd","Job ID":31,"Completion > Time":1616171909785,"Job Result":{"Result":"JobFailed","Exception": > {"Message":"Job aborted"} > }} > {code} > > > The SHS logfile > > > {code:java} > 23/05/01 21:57:16 INFO FsHistoryProvider: Parsing file:/tmp/nds_q86_fail_test > to re-build UI... > 23/05/01 21:57:17 ERROR ReplayListenerBus: Exception parsing Spark event log: > file:/tmp/nds_q86_fail_test > java.lang.NullPointerException > at > org.apache.spark.util.JsonProtocol$JsonNodeImplicits.extractElements(JsonProtocol.scala:1589) > at > org.apache.spark.util.JsonProtocol$.stackTraceFromJson(JsonProtocol.scala:1558) > at > org.apache.spark.util.JsonProtocol$.exceptionFromJson(JsonProtocol.scala:1569) > at > org.apache.spark.util.JsonProtocol$.jobResultFromJson(JsonProtocol.scala:1423) > at > org.apache.spark.util.JsonProtocol$.jobEndFromJson(JsonProtocol.scala:967) > at > org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:878) > at > org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:865) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:88) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:59) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$parseAppEventLogs$3(FsHistoryProvider.scala:1140) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$parseAppEventLogs$3$adapted(FsHistoryProvider.scala:1138) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2786) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$parseAppEventLogs$1(FsHistoryProvider.scala:1138) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$parseAppEventLogs$1$adapted(FsHistoryProvider.scala:1136) > at scala.collection.immutable.List.foreach(List.scala:431) > at > org.apache.spark.deploy.history.FsHistoryProvider.parseAppEventLogs(FsHistoryProvider.scala:1136) > at > org.apache.spark.deploy.history.FsHistoryProvider.rebuildAppStore(FsHistoryProvider.scala:1117) > at > org.apache.spark.deploy.history.FsHistoryProvider.createInMemoryStore(FsHistoryProvider.scala:1355) > at > org.apache.spark.deploy.history.FsHistoryProvider.getAppUI(FsHistoryProvider.scala:345) > at > org.apache.spark.deploy.history.HistoryServer.getAppUI(HistoryServer.scala:199) > at > org.apache.spark.deploy.history.ApplicationCache.$anonfun$loadApplicationEntry$2(ApplicationCache.scala:163) > at > org.apache.spark.deploy.history.ApplicationCache.time(ApplicationCache.scala:134) > at > org.apache.spark.deploy.history.ApplicationCache.org$apache$spark$deploy$history$ApplicationCache$$loadApplicationEntry(ApplicationCache.scala:161) > at > org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:55) > at > org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:51) > at > org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) > at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) > at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.deploy.history.ApplicationCache.get(ApplicationCache.scala:88) > at > org.apache.spark.deploy.history.ApplicationCache.withSparkUI(ApplicationCache.scala:100) > at > o
[jira] [Resolved] (SPARK-41585) The Spark exclude node functionality for YARN should work independently of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-41585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-41585. --- Fix Version/s: 3.5.0 Target Version/s: 3.5.0 Assignee: Luca Canali Resolution: Fixed > The Spark exclude node functionality for YARN should work independently of > dynamic allocation > - > > Key: SPARK-41585 > URL: https://issues.apache.org/jira/browse/SPARK-41585 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.3, 3.1.3, 3.2.2, 3.3.1 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 3.5.0 > > > The Spark exclude node functionality for Spark on YARN, introduced in > SPARK-26688, allows users to specify a list of node names that are excluded > from resource allocation. This is done using the configuration parameter: > {{spark.yarn.exclude.nodes}} > The feature currently works only for executors allocated via dynamic > allocation. To use the feature on Spark 3.3.1, for example, one may set the > configurations {{{}spark.dynamicAllocation.enabled{}}}=true, > spark.dynamicAllocation.minExecutors=0 and spark.executor.instances=0, thus > making Spark spawning executors only via dynamic allocation. > This proposes to document this behavior for the current Spark release and > also proposes an improvement of this feature by extending the scope of Spark > exclude node functionality for YARN beyond dynamic allocation, which I > believe makes it more generally useful. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692408#comment-17692408 ] Thomas Graves commented on SPARK-41793: --- [~ulysses] [~cloud_fan] [~xinrong] We need to decide what we are doing with this for 3.4 before doing any release. > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688016#comment-17688016 ] Thomas Graves commented on SPARK-39375: --- So regarding UDFs, its not clear to me how that is currently being implemented? If I have a python UDF does it require the server to be started like pyspark so the python process is already present? Or is it starting python on the side. It would be nice to have a design as [~xkrogen] mentioned. > SPIP: Spark Connect - A client and server interface for Apache Spark > > > Key: SPARK-39375 > URL: https://issues.apache.org/jira/browse/SPARK-39375 > Project: Spark > Issue Type: Epic > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Critical > Labels: SPIP > > Please find the full document for discussion here: [Spark Connect > SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj] > Below, we have just referenced the introduction. > h2. What are you trying to do? > While Spark is used extensively, it was designed nearly a decade ago, which, > in the age of serverless computing and ubiquitous programming language use, > poses a number of limitations. Most of the limitations stem from the tightly > coupled Spark driver architecture and fact that clusters are typically shared > across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark > driver runs both the client application and scheduler, which results in a > heavyweight architecture that requires proximity to the cluster. There is no > built-in capability to remotely connect to a Spark cluster in languages > other than SQL and users therefore rely on external solutions such as the > inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich > developer experience{*}: The current architecture and APIs do not cater for > interactive data exploration (as done with Notebooks), or allow for building > out rich developer experience common in modern code editors. (3) > {*}Stability{*}: with the current shared driver architecture, users causing > critical exceptions (e.g. OOM) bring the whole cluster down for all users. > (4) {*}Upgradability{*}: the current entangling of platform and client APIs > (e.g. first and third-party dependencies in the classpath) does not allow for > seamless upgrades between Spark versions (and with that, hinders new feature > adoption). > > We propose to overcome these challenges by building on the DataFrame API and > the underlying unresolved logical plans. The DataFrame API is widely used and > makes it very easy to iteratively express complex logic. We will introduce > {_}Spark Connect{_}, a remote option of the DataFrame API that separates the > client from the Spark server. With Spark Connect, Spark will become > decoupled, allowing for built-in remote connectivity: The decoupled client > SDK can be used to run interactive data exploration and connect to the server > for DataFrame operations. > > Spark Connect will benefit Spark developers in different ways: The decoupled > architecture will result in improved stability, as clients are separated from > the driver. From the Spark Connect client perspective, Spark will be (almost) > versionless, and thus enable seamless upgradability, as server APIs can > evolve without affecting the client API. The decoupled client-server > architecture can be leveraged to build close integrations with local > developer tooling. Finally, separating the client process from the Spark > server process will improve Spark’s overall security posture by avoiding the > tight coupling of the client inside the Spark runtime environment. > > Spark Connect will strengthen Spark’s position as the modern unified engine > for large-scale data analytics and expand applicability to use cases and > developers we could not reach with the current setup: Spark will become > ubiquitously usable as the DataFrame API can be used with (almost) any > programming language. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42374) User-facing documentaiton
[ https://issues.apache.org/jira/browse/SPARK-42374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687978#comment-17687978 ] Thomas Graves commented on SPARK-42374: --- Just a note that we should make sure to document that there is no built in authentication with this, unless that has changed since Design > User-facing documentaiton > - > > Key: SPARK-42374 > URL: https://issues.apache.org/jira/browse/SPARK-42374 > Project: Spark > Issue Type: Documentation > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Haejoon Lee >Priority: Major > > Should provide the user-facing documentation so end users how to use Spark > Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-41793: -- Labels: correctness (was: ) > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Major > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-41793: -- Priority: Blocker (was: Major) > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678744#comment-17678744 ] Thomas Graves commented on SPARK-41793: --- this sounds like a correctness issue - [~cloud_fan] [~ulyssesyou] am I missing something here? > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Major > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39601) AllocationFailure should not be treated as exitCausedByApp when driver is shutting down
[ https://issues.apache.org/jira/browse/SPARK-39601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-39601. --- Fix Version/s: 3.4.0 Assignee: Cheng Pan Resolution: Fixed > AllocationFailure should not be treated as exitCausedByApp when driver is > shutting down > --- > > Key: SPARK-39601 > URL: https://issues.apache.org/jira/browse/SPARK-39601 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.3.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.4.0 > > > I observed some Spark Applications successfully completed all jobs but failed > during the shutting down phase w/ reason: Max number of executor failures > (16) reached, the timeline is > Driver - Job success, Spark starts shutting down procedure. > {code:java} > 2022-06-23 19:50:55 CST AbstractConnector INFO - Stopped > Spark@74e9431b{HTTP/1.1, (http/1.1)} > {0.0.0.0:0} > 2022-06-23 19:50:55 CST SparkUI INFO - Stopped Spark web UI at > http://hadoop2627.xxx.org:28446 > 2022-06-23 19:50:55 CST YarnClusterSchedulerBackend INFO - Shutting down all > executors > {code} > Driver - A container allocate successful during shutting down phase. > {code:java} > 2022-06-23 19:52:21 CST YarnAllocator INFO - Launching container > container_e94_1649986670278_7743380_02_25 on host hadoop4388.xxx.org for > executor with ID 24 for ResourceProfile Id 0{code} > Executor - The executor can not connect to driver endpoint because driver > already stopped the endpoint. > {code:java} > Exception in thread "main" java.lang.reflect.UndeclaredThrowableException > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393) > at > org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81) > at > org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala) > Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:413) > at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23) > at > scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) > at scala.collection.immutable.Range.foreach(Range.scala:158) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:411) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > ... 4 more > Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find > endpoint: spark://coarsegrainedschedu...@hadoop2627.xxx.org:21956 > at > org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1(NettyRpcEnv.scala:148) > at > org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1$adapted(NettyRpcEnv.scala:144) > at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307) > at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) > at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138) > at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288){code} > Driver - YarnAllocator received container launch error message and treat it > as `exitCausedByApp` > {code:java} > 2022-06-23 19:52:27 CST YarnAllocator
[jira] [Updated] (SPARK-40524) local mode with resource scheduling can hang
[ https://issues.apache.org/jira/browse/SPARK-40524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-40524: -- Summary: local mode with resource scheduling can hang (was: local mode with resource scheduling should just fail) > local mode with resource scheduling can hang > > > Key: SPARK-40524 > URL: https://issues.apache.org/jira/browse/SPARK-40524 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Priority: Major > > If you try to run spark in local mode and request custom resources like > GPU's, Spark will hang. Resource scheduling isn't supported in local mode so > just removing the request for resources fixes the issue, but its really > confusing to users since it just hangs. > > ie to reproduce: > spark-sql --conf spark.executor.resource.gpu.amount=1 --conf > spark.task.resource.gpu.amount=1 > Run: > select 1 > result: hangs > To fix run: > spark-sql > > spark-sql> select 1; > 1 > Time taken: 2.853 seconds, Fetched 1 row(s) > > It would be nice if we just fail to start or threw an exception when using > those options in local mode -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40524) local mode with resource scheduling should just fail
Thomas Graves created SPARK-40524: - Summary: local mode with resource scheduling should just fail Key: SPARK-40524 URL: https://issues.apache.org/jira/browse/SPARK-40524 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.0 Reporter: Thomas Graves If you try to run spark in local mode and request custom resources like GPU's, Spark will hang. Resource scheduling isn't supported in local mode so just removing the request for resources fixes the issue, but its really confusing to users since it just hangs. ie to reproduce: spark-sql --conf spark.executor.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 Run: select 1 result: hangs To fix run: spark-sql spark-sql> select 1; 1 Time taken: 2.853 seconds, Fetched 1 row(s) It would be nice if we just fail to start or threw an exception when using those options in local mode -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40490) `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321
[ https://issues.apache.org/jira/browse/SPARK-40490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-40490. --- Fix Version/s: 3.4.0 Assignee: Yang Jie Resolution: Fixed > `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload > after SPARK-17321 > > > Key: SPARK-40490 > URL: https://issues.apache.org/jira/browse/SPARK-40490 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > > After SPARK-17321, YarnShuffleService will persist data to local shuffle > state db and reload data from local shuffle state db only when Yarn > NodeManager start with `YarnConfiguration#NM_RECOVERY_ENABLED = true` , but > `YarnShuffleIntegrationSuite` not set this config and the default value of > the configuration is false, so `YarnShuffleIntegrationSuite` will neither > trigger data persistence to the db nor verify the reload of data > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40280) Failure to create parquet predicate push down for ints and longs on some valid files
[ https://issues.apache.org/jira/browse/SPARK-40280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-40280. --- Fix Version/s: 3.4.0 3.3.1 3.2.3 Assignee: Robert Joseph Evans Resolution: Fixed > Failure to create parquet predicate push down for ints and longs on some > valid files > > > Key: SPARK-40280 > URL: https://issues.apache.org/jira/browse/SPARK-40280 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: Robert Joseph Evans >Assignee: Robert Joseph Evans >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.2.3 > > > The [parquet > format|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#signed-integers] > specification states that... > bq. {{{}INT(8, true){}}}, {{{}INT(16, true){}}}, and {{INT(32, true)}} must > annotate an {{int32}} primitive type and {{INT(64, true)}} must annotate an > {{int64}} primitive type. {{INT(32, true)}} and {{INT(64, true)}} are implied > by the {{int32}} and {{int64}} primitive types if no other annotation is > present and should be considered optional. > But the code inside of > [ParquetFilters.scala|https://github.com/apache/spark/blob/296fe49ec855ac8c15c080e7bab6d519fe504bd3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L125-L126] > requires that for {{int32}} and {{int64}} that there be no annotation. If > there is an annotation for those columns and they are a part of a predicate > push down, the hard coded types will not match and the corresponding filter > ends up being {{None}}. > This can be a huge performance penalty for a valid parquet file. > I am happy to provide files that show the issue if needed for testing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38888) Add `RocksDBProvider` similar to `LevelDBProvider`
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579738#comment-17579738 ] Thomas Graves commented on SPARK-3: --- Just curious does rocksdb give us some particular benefit - performance or compatibility? Is leveldb not support on apple silicon? Just curious and would be good to record why we add support. > Add `RocksDBProvider` similar to `LevelDBProvider` > -- > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > `LevelDBProvider` is used by `ExternalShuffleBlockResolver` and > `YarnShuffleService`, a corresponding `RocksDB` implementation should be added -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38910) Clean sparkStaging dir should before unregister()
[ https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-38910: -- Fix Version/s: 3.4.0 > Clean sparkStaging dir should before unregister() > - > > Key: SPARK-38910 > URL: https://issues.apache.org/jira/browse/SPARK-38910 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.1, 3.3.0 >Reporter: angerszhu >Priority: Minor > Fix For: 3.4.0 > > > {code:java} > ShutdownHookManager.addShutdownHook(priority) { () => > try { > val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf) > val isLastAttempt = appAttemptId.getAttemptId() >= maxAppAttempts > if (!finished) { > // The default state of ApplicationMaster is failed if it is > invoked by shut down hook. > // This behavior is different compared to 1.x version. > // If user application is exited ahead of time by calling > System.exit(N), here mark > // this application as failed with EXIT_EARLY. For a good > shutdown, user shouldn't call > // System.exit(0) to terminate the application. > finish(finalStatus, > ApplicationMaster.EXIT_EARLY, > "Shutdown hook called before final status was reported.") > } > if (!unregistered) { > // we only want to unregister if we don't want the RM to retry > if (finalStatus == FinalApplicationStatus.SUCCEEDED || > isLastAttempt) { > unregister(finalStatus, finalMsg) > cleanupStagingDir(new > Path(System.getenv("SPARK_YARN_STAGING_DIR"))) > } > } > } catch { > case e: Throwable => > logWarning("Ignoring Exception while stopping ApplicationMaster > from shutdown hook", e) > } > }{code} > unregister may throw exception, clean staging dir should before unregister. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38910) Clean sparkStaging dir should before unregister()
[ https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-38910. --- Resolution: Fixed > Clean sparkStaging dir should before unregister() > - > > Key: SPARK-38910 > URL: https://issues.apache.org/jira/browse/SPARK-38910 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.1, 3.3.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > Fix For: 3.4.0 > > > {code:java} > ShutdownHookManager.addShutdownHook(priority) { () => > try { > val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf) > val isLastAttempt = appAttemptId.getAttemptId() >= maxAppAttempts > if (!finished) { > // The default state of ApplicationMaster is failed if it is > invoked by shut down hook. > // This behavior is different compared to 1.x version. > // If user application is exited ahead of time by calling > System.exit(N), here mark > // this application as failed with EXIT_EARLY. For a good > shutdown, user shouldn't call > // System.exit(0) to terminate the application. > finish(finalStatus, > ApplicationMaster.EXIT_EARLY, > "Shutdown hook called before final status was reported.") > } > if (!unregistered) { > // we only want to unregister if we don't want the RM to retry > if (finalStatus == FinalApplicationStatus.SUCCEEDED || > isLastAttempt) { > unregister(finalStatus, finalMsg) > cleanupStagingDir(new > Path(System.getenv("SPARK_YARN_STAGING_DIR"))) > } > } > } catch { > case e: Throwable => > logWarning("Ignoring Exception while stopping ApplicationMaster > from shutdown hook", e) > } > }{code} > unregister may throw exception, clean staging dir should before unregister. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38910) Clean sparkStaging dir should before unregister()
[ https://issues.apache.org/jira/browse/SPARK-38910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-38910: - Assignee: angerszhu > Clean sparkStaging dir should before unregister() > - > > Key: SPARK-38910 > URL: https://issues.apache.org/jira/browse/SPARK-38910 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.1, 3.3.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > Fix For: 3.4.0 > > > {code:java} > ShutdownHookManager.addShutdownHook(priority) { () => > try { > val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf) > val isLastAttempt = appAttemptId.getAttemptId() >= maxAppAttempts > if (!finished) { > // The default state of ApplicationMaster is failed if it is > invoked by shut down hook. > // This behavior is different compared to 1.x version. > // If user application is exited ahead of time by calling > System.exit(N), here mark > // this application as failed with EXIT_EARLY. For a good > shutdown, user shouldn't call > // System.exit(0) to terminate the application. > finish(finalStatus, > ApplicationMaster.EXIT_EARLY, > "Shutdown hook called before final status was reported.") > } > if (!unregistered) { > // we only want to unregister if we don't want the RM to retry > if (finalStatus == FinalApplicationStatus.SUCCEEDED || > isLastAttempt) { > unregister(finalStatus, finalMsg) > cleanupStagingDir(new > Path(System.getenv("SPARK_YARN_STAGING_DIR"))) > } > } > } catch { > case e: Throwable => > logWarning("Ignoring Exception while stopping ApplicationMaster > from shutdown hook", e) > } > }{code} > unregister may throw exception, clean staging dir should before unregister. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param
[ https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-39976: -- Labels: correctness (was: ) > NULL check in ArrayIntersect adds extraneous null from first param > -- > > Key: SPARK-39976 > URL: https://issues.apache.org/jira/browse/SPARK-39976 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Navin Kumar >Priority: Major > Labels: correctness > > This is very likely a regression from SPARK-36829. > When using {{array_intersect(a, b)}}, if the first parameter contains a > {{NULL}} value and the second one does not, an extraneous {{NULL}} is present > in the output. This also leads to {{array_intersect(a, b) != > array_intersect(b, a)}} which is incorrect as set intersection should be > commutative. > Example using PySpark: > {code:python} > >>> a = [1, 2, 3] > >>> b = [3, None, 5] > >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"]) > >>> df.show() > +-++ > |a| b| > +-++ > |[1, 2, 3]|[3, null, 5]| > +-++ > >>> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > | [3]| > +-+ > >>> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > |[3, null]| > +-+ > {code} > Note that in the first case, {{a}} does not contain a {{NULL}}, and the final > output is correct: {{[3]}}. In the second case, since {{b}} does contain > {{NULL}} and is now the first parameter. > The same behavior occurs in Scala when writing to Parquet: > {code:scala} > scala> val a = Array[java.lang.Integer](1, 2, null, 4) > a: Array[Integer] = Array(1, 2, null, 4) > scala> val b = Array[java.lang.Integer](4, 5, 6, 7) > b: Array[Integer] = Array(4, 5, 6, 7) > scala> val df = Seq((a, b)).toDF("a","b") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.write.parquet("/tmp/simple.parquet") > scala> val df = spark.read.parquet("/tmp/simple.parquet") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.show() > +---++ > | a| b| > +---++ > |[1, 2, null, 4]|[4, 5, 6, 7]| > +---++ > scala> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > |[null, 4]| > +-+ > scala> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > | [4]| > +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param
[ https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-39976: -- Labels: (was: corr) > NULL check in ArrayIntersect adds extraneous null from first param > -- > > Key: SPARK-39976 > URL: https://issues.apache.org/jira/browse/SPARK-39976 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Navin Kumar >Priority: Blocker > > This is very likely a regression from SPARK-36829. > When using {{array_intersect(a, b)}}, if the first parameter contains a > {{NULL}} value and the second one does not, an extraneous {{NULL}} is present > in the output. This also leads to {{array_intersect(a, b) != > array_intersect(b, a)}} which is incorrect as set intersection should be > commutative. > Example using PySpark: > {code:python} > >>> a = [1, 2, 3] > >>> b = [3, None, 5] > >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"]) > >>> df.show() > +-++ > |a| b| > +-++ > |[1, 2, 3]|[3, null, 5]| > +-++ > >>> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > | [3]| > +-+ > >>> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > |[3, null]| > +-+ > {code} > Note that in the first case, {{a}} does not contain a {{NULL}}, and the final > output is correct: {{[3]}}. In the second case, since {{b}} does contain > {{NULL}} and is now the first parameter. > The same behavior occurs in Scala when writing to Parquet: > {code:scala} > scala> val a = Array[java.lang.Integer](1, 2, null, 4) > a: Array[Integer] = Array(1, 2, null, 4) > scala> val b = Array[java.lang.Integer](4, 5, 6, 7) > b: Array[Integer] = Array(4, 5, 6, 7) > scala> val df = Seq((a, b)).toDF("a","b") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.write.parquet("/tmp/simple.parquet") > scala> val df = spark.read.parquet("/tmp/simple.parquet") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.show() > +---++ > | a| b| > +---++ > |[1, 2, null, 4]|[4, 5, 6, 7]| > +---++ > scala> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > |[null, 4]| > +-+ > scala> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > | [4]| > +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param
[ https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-39976: -- Priority: Blocker (was: Major) > NULL check in ArrayIntersect adds extraneous null from first param > -- > > Key: SPARK-39976 > URL: https://issues.apache.org/jira/browse/SPARK-39976 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Navin Kumar >Priority: Blocker > > This is very likely a regression from SPARK-36829. > When using {{array_intersect(a, b)}}, if the first parameter contains a > {{NULL}} value and the second one does not, an extraneous {{NULL}} is present > in the output. This also leads to {{array_intersect(a, b) != > array_intersect(b, a)}} which is incorrect as set intersection should be > commutative. > Example using PySpark: > {code:python} > >>> a = [1, 2, 3] > >>> b = [3, None, 5] > >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"]) > >>> df.show() > +-++ > |a| b| > +-++ > |[1, 2, 3]|[3, null, 5]| > +-++ > >>> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > | [3]| > +-+ > >>> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > |[3, null]| > +-+ > {code} > Note that in the first case, {{a}} does not contain a {{NULL}}, and the final > output is correct: {{[3]}}. In the second case, since {{b}} does contain > {{NULL}} and is now the first parameter. > The same behavior occurs in Scala when writing to Parquet: > {code:scala} > scala> val a = Array[java.lang.Integer](1, 2, null, 4) > a: Array[Integer] = Array(1, 2, null, 4) > scala> val b = Array[java.lang.Integer](4, 5, 6, 7) > b: Array[Integer] = Array(4, 5, 6, 7) > scala> val df = Seq((a, b)).toDF("a","b") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.write.parquet("/tmp/simple.parquet") > scala> val df = spark.read.parquet("/tmp/simple.parquet") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.show() > +---++ > | a| b| > +---++ > |[1, 2, null, 4]|[4, 5, 6, 7]| > +---++ > scala> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > |[null, 4]| > +-+ > scala> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > | [4]| > +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param
[ https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-39976: -- Labels: corr (was: ) > NULL check in ArrayIntersect adds extraneous null from first param > -- > > Key: SPARK-39976 > URL: https://issues.apache.org/jira/browse/SPARK-39976 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Navin Kumar >Priority: Blocker > Labels: corr > > This is very likely a regression from SPARK-36829. > When using {{array_intersect(a, b)}}, if the first parameter contains a > {{NULL}} value and the second one does not, an extraneous {{NULL}} is present > in the output. This also leads to {{array_intersect(a, b) != > array_intersect(b, a)}} which is incorrect as set intersection should be > commutative. > Example using PySpark: > {code:python} > >>> a = [1, 2, 3] > >>> b = [3, None, 5] > >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"]) > >>> df.show() > +-++ > |a| b| > +-++ > |[1, 2, 3]|[3, null, 5]| > +-++ > >>> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > | [3]| > +-+ > >>> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > |[3, null]| > +-+ > {code} > Note that in the first case, {{a}} does not contain a {{NULL}}, and the final > output is correct: {{[3]}}. In the second case, since {{b}} does contain > {{NULL}} and is now the first parameter. > The same behavior occurs in Scala when writing to Parquet: > {code:scala} > scala> val a = Array[java.lang.Integer](1, 2, null, 4) > a: Array[Integer] = Array(1, 2, null, 4) > scala> val b = Array[java.lang.Integer](4, 5, 6, 7) > b: Array[Integer] = Array(4, 5, 6, 7) > scala> val df = Seq((a, b)).toDF("a","b") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.write.parquet("/tmp/simple.parquet") > scala> val df = spark.read.parquet("/tmp/simple.parquet") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.show() > +---++ > | a| b| > +---++ > |[1, 2, null, 4]|[4, 5, 6, 7]| > +---++ > scala> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > |[null, 4]| > +-+ > scala> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > | [4]| > +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param
[ https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575398#comment-17575398 ] Thomas Graves commented on SPARK-39976: --- [~cloud_fan] [~angerszhuuu] who worked on original issue. This sounds like correctness to me so we should add label if so. > NULL check in ArrayIntersect adds extraneous null from first param > -- > > Key: SPARK-39976 > URL: https://issues.apache.org/jira/browse/SPARK-39976 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Navin Kumar >Priority: Major > > This is very likely a regression from SPARK-36829. > When using {{array_intersect(a, b)}}, if the first parameter contains a > {{NULL}} value and the second one does not, an extraneous {{NULL}} is present > in the output. This also leads to {{array_intersect(a, b) != > array_intersect(b, a)}} which is incorrect as set intersection should be > commutative. > Example using PySpark: > {code:python} > >>> a = [1, 2, 3] > >>> b = [3, None, 5] > >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"]) > >>> df.show() > +-++ > |a| b| > +-++ > |[1, 2, 3]|[3, null, 5]| > +-++ > >>> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > | [3]| > +-+ > >>> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > |[3, null]| > +-+ > {code} > Note that in the first case, {{a}} does not contain a {{NULL}}, and the final > output is correct: {{[3]}}. In the second case, since {{b}} does contain > {{NULL}} and is now the first parameter. > The same behavior occurs in Scala when writing to Parquet: > {code:scala} > scala> val a = Array[java.lang.Integer](1, 2, null, 4) > a: Array[Integer] = Array(1, 2, null, 4) > scala> val b = Array[java.lang.Integer](4, 5, 6, 7) > b: Array[Integer] = Array(4, 5, 6, 7) > scala> val df = Seq((a, b)).toDF("a","b") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.write.parquet("/tmp/simple.parquet") > scala> val df = spark.read.parquet("/tmp/simple.parquet") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.show() > +---++ > | a| b| > +---++ > |[1, 2, null, 4]|[4, 5, 6, 7]| > +---++ > scala> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > |[null, 4]| > +-+ > scala> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > | [4]| > +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39491) Hadoop 2.7 build fails due to org.apache.hadoop.yarn.api.records.NodeState.DECOMMISSIONING
Thomas Graves created SPARK-39491: - Summary: Hadoop 2.7 build fails due to org.apache.hadoop.yarn.api.records.NodeState.DECOMMISSIONING Key: SPARK-39491 URL: https://issues.apache.org/jira/browse/SPARK-39491 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Thomas Graves trying to build with the hadoop-2 profile, which uses hadoop 2.7 version fails due to: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala:454: value DECOMMISSIONING is not a member of object org.apache.hadoop.yarn.api.records.NodeState DECOMMISSIONING was only added in hadoop 2.8. This was added in https://issues.apache.org/jira/browse/SPARK-30835 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39107) Silent change in regexp_replace's handling of empty strings
[ https://issues.apache.org/jira/browse/SPARK-39107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555111#comment-17555111 ] Thomas Graves commented on SPARK-39107: --- [~srowen] I think this actually went into 3.1.4, not 3.1.3, could you confirm before I update Fixed versions? > Silent change in regexp_replace's handling of empty strings > --- > > Key: SPARK-39107 > URL: https://issues.apache.org/jira/browse/SPARK-39107 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Willi Raschkowski >Assignee: Lorenzo Martini >Priority: Major > Labels: correctness, release-notes > Fix For: 3.1.3, 3.3.0, 3.2.2 > > > Hi, we just upgraded from 3.0.2 to 3.1.2 and noticed a silent behavior change > that a) seems incorrect, and b) is undocumented in the [migration > guide|https://spark.apache.org/docs/latest/sql-migration-guide.html]: > {code:title=3.0.2} > scala> val df = spark.sql("SELECT '' AS col") > df: org.apache.spark.sql.DataFrame = [col: string] > scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", > "")).show > +---++ > |col|replaced| > +---++ > | | | > +---++ > {code} > {code:title=3.1.2} > scala> val df = spark.sql("SELECT '' AS col") > df: org.apache.spark.sql.DataFrame = [col: string] > scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", > "")).show > +---++ > |col|replaced| > +---++ > | || > +---++ > {code} > Note, the regular expression {{^$}} should match the empty string, but > doesn't in version 3.1. E.g. this is the Java behavior: > {code} > scala> "".replaceAll("^$", ""); > res1: String = > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39434) Provide runtime error query context when array index is out of bound
[ https://issues.apache.org/jira/browse/SPARK-39434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-39434: -- Fix Version/s: 3.4.0 > Provide runtime error query context when array index is out of bound > > > Key: SPARK-39434 > URL: https://issues.apache.org/jira/browse/SPARK-39434 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39363) fix spark.kubernetes.memoryOverheadFactor deprecation warning
[ https://issues.apache.org/jira/browse/SPARK-39363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545479#comment-17545479 ] Thomas Graves commented on SPARK-39363: --- [~Kimahriman] > fix spark.kubernetes.memoryOverheadFactor deprecation warning > - > > Key: SPARK-39363 > URL: https://issues.apache.org/jira/browse/SPARK-39363 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Thomas Graves >Priority: Major > > see [https://github.com/apache/spark/pull/36744] for details. > > It looks like we deprecated {{spark.kubernetes.memoryOverheadFactor, but it > has a default value which leads it to printing warnings all the time.}} > {{}} > {code:java} > 22/06/01 23:53:49 WARN SparkConf: The configuration key > 'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 > and may be removed in the future. Please use > spark.driver.memoryOverheadFactor and > spark.executor.memoryOverheadFactor{code} > {{}} > {{We should remove the default value if possible. It should only be used as > fallback but we should be able to use the default from }} > spark.driver.memoryOverheadFactor. > {{{}{}}}{{{}{}}} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39363) fix spark.kubernetes.memoryOverheadFactor deprecation warning
[ https://issues.apache.org/jira/browse/SPARK-39363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-39363: -- Description: see [https://github.com/apache/spark/pull/36744] for details. It looks like we deprecated {{spark.kubernetes.memoryOverheadFactor, but it has a default value which leads it to printing warnings all the time.}} {{}} {code:java} 22/06/01 23:53:49 WARN SparkConf: The configuration key 'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 and may be removed in the future. Please use spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor{code} {{}} {{We should remove the default value if possible. It should only be used as fallback but we should be able to use the default from }}spark.driver.memoryOverheadFactor. was: see [https://github.com/apache/spark/pull/36744] for details. It looks like we deprecated {{spark.kubernetes.memoryOverheadFactor, but it has a default value which leads it to printing warnings all the time.}} {{}} {code:java} 22/06/01 23:53:49 WARN SparkConf: The configuration key 'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 and may be removed in the future. Please use spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor{code} {{}} {{We should remove the default value if possible. It should only be used as fallback but we should be able to use the default from }} spark.driver.memoryOverheadFactor. {{{}{}}}{{{}{}}} > fix spark.kubernetes.memoryOverheadFactor deprecation warning > - > > Key: SPARK-39363 > URL: https://issues.apache.org/jira/browse/SPARK-39363 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Thomas Graves >Priority: Major > > see [https://github.com/apache/spark/pull/36744] for details. > > It looks like we deprecated {{spark.kubernetes.memoryOverheadFactor, but it > has a default value which leads it to printing warnings all the time.}} > {{}} > {code:java} > 22/06/01 23:53:49 WARN SparkConf: The configuration key > 'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 > and may be removed in the future. Please use > spark.driver.memoryOverheadFactor and > spark.executor.memoryOverheadFactor{code} > {{}} > {{We should remove the default value if possible. It should only be used as > fallback but we should be able to use the default from > }}spark.driver.memoryOverheadFactor. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39363) fix spark.kubernetes.memoryOverheadFactor deprecation warning
Thomas Graves created SPARK-39363: - Summary: fix spark.kubernetes.memoryOverheadFactor deprecation warning Key: SPARK-39363 URL: https://issues.apache.org/jira/browse/SPARK-39363 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.4.0 Reporter: Thomas Graves see [https://github.com/apache/spark/pull/36744] for details. It looks like we deprecated {{spark.kubernetes.memoryOverheadFactor, but it has a default value which leads it to printing warnings all the time.}} {{}} {code:java} 22/06/01 23:53:49 WARN SparkConf: The configuration key 'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 and may be removed in the future. Please use spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor{code} {{}} {{We should remove the default value if possible. It should only be used as fallback but we should be able to use the default from }} spark.driver.memoryOverheadFactor. {{{}{}}}{{{}{}}} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38955) from_csv can corrupt surrounding lines if a lineSep is in the data
[ https://issues.apache.org/jira/browse/SPARK-38955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-38955: -- Labels: (was: corr) > from_csv can corrupt surrounding lines if a lineSep is in the data > -- > > Key: SPARK-38955 > URL: https://issues.apache.org/jira/browse/SPARK-38955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Robert Joseph Evans >Priority: Blocker > > I don't know how critical this is. I was doing some general testing to > understand {{from_csv}} and found that if I happen to have a {{lineSep}} in > the input data and I noticed that the next row appears to be corrupted. > {{multiLine}} does not appear to fix it. Because this is data corruption I am > inclined to mark this as CRITICAL or BLOCKER, but it is an odd corner case so > I m not going to set it myself. > {code} > Seq[String]("1,\n2,3,4,5","6,7,8,9,10", "11,12,13,14,15", > null).toDF.select(col("value"), from_csv(col("value"), > StructType(Seq(StructField("a", LongType), StructField("b", StringType))), > Map[String,String]())).show() > +--+---+ > | value|from_csv(value)| > +--+---+ > | 1,\n2,3,4,5| {1, null}| > |6,7,8,9,10| {null, 8}| > |11,12,13,14,15| {11, 12}| > | null| null| > +--+---+ > {code} > {code} > Seq[String]("1,:2,3,4,5","6,7,8,9,10", "11,12,13,14,15", > null).toDF.select(col("value"), from_csv(col("value"), > StructType(Seq(StructField("a", LongType), StructField("b", StringType))), > Map[String,String]("lineSep" -> ":"))).show() > +--+---+ > | value|from_csv(value)| > +--+---+ > |1,:2,3,4,5| {1, null}| > |6,7,8,9,10| {null, 8}| > |11,12,13,14,15| {11, 12}| > | null| null| > +--+---+ > {code} > {code} > Seq[String]("1,\n2,3,4,5","6,7,8,9,10", "11,12,13,14,15", > null).toDF.select(col("value"), from_csv(col("value"), > StructType(Seq(StructField("a", LongType), StructField("b", StringType))), > Map[String,String]("lineSep" -> ":"))).show() > +--+---+ > | value|from_csv(value)| > +--+---+ > | 1,\n2,3,4,5| {1, \n2}| > |6,7,8,9,10| {6, 7}| > |11,12,13,14,15| {11, 12}| > | null| null| > +--+---+ > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38955) from_csv can corrupt surrounding lines if a lineSep is in the data
[ https://issues.apache.org/jira/browse/SPARK-38955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-38955: -- Priority: Blocker (was: Major) > from_csv can corrupt surrounding lines if a lineSep is in the data > -- > > Key: SPARK-38955 > URL: https://issues.apache.org/jira/browse/SPARK-38955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Robert Joseph Evans >Priority: Blocker > > I don't know how critical this is. I was doing some general testing to > understand {{from_csv}} and found that if I happen to have a {{lineSep}} in > the input data and I noticed that the next row appears to be corrupted. > {{multiLine}} does not appear to fix it. Because this is data corruption I am > inclined to mark this as CRITICAL or BLOCKER, but it is an odd corner case so > I m not going to set it myself. > {code} > Seq[String]("1,\n2,3,4,5","6,7,8,9,10", "11,12,13,14,15", > null).toDF.select(col("value"), from_csv(col("value"), > StructType(Seq(StructField("a", LongType), StructField("b", StringType))), > Map[String,String]())).show() > +--+---+ > | value|from_csv(value)| > +--+---+ > | 1,\n2,3,4,5| {1, null}| > |6,7,8,9,10| {null, 8}| > |11,12,13,14,15| {11, 12}| > | null| null| > +--+---+ > {code} > {code} > Seq[String]("1,:2,3,4,5","6,7,8,9,10", "11,12,13,14,15", > null).toDF.select(col("value"), from_csv(col("value"), > StructType(Seq(StructField("a", LongType), StructField("b", StringType))), > Map[String,String]("lineSep" -> ":"))).show() > +--+---+ > | value|from_csv(value)| > +--+---+ > |1,:2,3,4,5| {1, null}| > |6,7,8,9,10| {null, 8}| > |11,12,13,14,15| {11, 12}| > | null| null| > +--+---+ > {code} > {code} > Seq[String]("1,\n2,3,4,5","6,7,8,9,10", "11,12,13,14,15", > null).toDF.select(col("value"), from_csv(col("value"), > StructType(Seq(StructField("a", LongType), StructField("b", StringType))), > Map[String,String]("lineSep" -> ":"))).show() > +--+---+ > | value|from_csv(value)| > +--+---+ > | 1,\n2,3,4,5| {1, \n2}| > |6,7,8,9,10| {6, 7}| > |11,12,13,14,15| {11, 12}| > | null| null| > +--+---+ > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38955) from_csv can corrupt surrounding lines if a lineSep is in the data
[ https://issues.apache.org/jira/browse/SPARK-38955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-38955: -- Labels: corr (was: ) > from_csv can corrupt surrounding lines if a lineSep is in the data > -- > > Key: SPARK-38955 > URL: https://issues.apache.org/jira/browse/SPARK-38955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Robert Joseph Evans >Priority: Blocker > Labels: corr > > I don't know how critical this is. I was doing some general testing to > understand {{from_csv}} and found that if I happen to have a {{lineSep}} in > the input data and I noticed that the next row appears to be corrupted. > {{multiLine}} does not appear to fix it. Because this is data corruption I am > inclined to mark this as CRITICAL or BLOCKER, but it is an odd corner case so > I m not going to set it myself. > {code} > Seq[String]("1,\n2,3,4,5","6,7,8,9,10", "11,12,13,14,15", > null).toDF.select(col("value"), from_csv(col("value"), > StructType(Seq(StructField("a", LongType), StructField("b", StringType))), > Map[String,String]())).show() > +--+---+ > | value|from_csv(value)| > +--+---+ > | 1,\n2,3,4,5| {1, null}| > |6,7,8,9,10| {null, 8}| > |11,12,13,14,15| {11, 12}| > | null| null| > +--+---+ > {code} > {code} > Seq[String]("1,:2,3,4,5","6,7,8,9,10", "11,12,13,14,15", > null).toDF.select(col("value"), from_csv(col("value"), > StructType(Seq(StructField("a", LongType), StructField("b", StringType))), > Map[String,String]("lineSep" -> ":"))).show() > +--+---+ > | value|from_csv(value)| > +--+---+ > |1,:2,3,4,5| {1, null}| > |6,7,8,9,10| {null, 8}| > |11,12,13,14,15| {11, 12}| > | null| null| > +--+---+ > {code} > {code} > Seq[String]("1,\n2,3,4,5","6,7,8,9,10", "11,12,13,14,15", > null).toDF.select(col("value"), from_csv(col("value"), > StructType(Seq(StructField("a", LongType), StructField("b", StringType))), > Map[String,String]("lineSep" -> ":"))).show() > +--+---+ > | value|from_csv(value)| > +--+---+ > | 1,\n2,3,4,5| {1, \n2}| > |6,7,8,9,10| {6, 7}| > |11,12,13,14,15| {11, 12}| > | null| null| > +--+---+ > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38955) from_csv can corrupt surrounding lines if a lineSep is in the data
[ https://issues.apache.org/jira/browse/SPARK-38955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524985#comment-17524985 ] Thomas Graves commented on SPARK-38955: --- the from_csv docs point to the data source options which contain the lineSep so it seems like we should update documentation and then like you said don't permit it to be specified. since its a corruption seems bad, so marking as blocker to atleast get more visibility and input. > from_csv can corrupt surrounding lines if a lineSep is in the data > -- > > Key: SPARK-38955 > URL: https://issues.apache.org/jira/browse/SPARK-38955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Robert Joseph Evans >Priority: Major > > I don't know how critical this is. I was doing some general testing to > understand {{from_csv}} and found that if I happen to have a {{lineSep}} in > the input data and I noticed that the next row appears to be corrupted. > {{multiLine}} does not appear to fix it. Because this is data corruption I am > inclined to mark this as CRITICAL or BLOCKER, but it is an odd corner case so > I m not going to set it myself. > {code} > Seq[String]("1,\n2,3,4,5","6,7,8,9,10", "11,12,13,14,15", > null).toDF.select(col("value"), from_csv(col("value"), > StructType(Seq(StructField("a", LongType), StructField("b", StringType))), > Map[String,String]())).show() > +--+---+ > | value|from_csv(value)| > +--+---+ > | 1,\n2,3,4,5| {1, null}| > |6,7,8,9,10| {null, 8}| > |11,12,13,14,15| {11, 12}| > | null| null| > +--+---+ > {code} > {code} > Seq[String]("1,:2,3,4,5","6,7,8,9,10", "11,12,13,14,15", > null).toDF.select(col("value"), from_csv(col("value"), > StructType(Seq(StructField("a", LongType), StructField("b", StringType))), > Map[String,String]("lineSep" -> ":"))).show() > +--+---+ > | value|from_csv(value)| > +--+---+ > |1,:2,3,4,5| {1, null}| > |6,7,8,9,10| {null, 8}| > |11,12,13,14,15| {11, 12}| > | null| null| > +--+---+ > {code} > {code} > Seq[String]("1,\n2,3,4,5","6,7,8,9,10", "11,12,13,14,15", > null).toDF.select(col("value"), from_csv(col("value"), > StructType(Seq(StructField("a", LongType), StructField("b", StringType))), > Map[String,String]("lineSep" -> ":"))).show() > +--+---+ > | value|from_csv(value)| > +--+---+ > | 1,\n2,3,4,5| {1, \n2}| > |6,7,8,9,10| {6, 7}| > |11,12,13,14,15| {11, 12}| > | null| null| > +--+---+ > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38677) pyspark hangs in local mode running rdd map operation
[ https://issues.apache.org/jira/browse/SPARK-38677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-38677: -- Description: In spark 3.2.1 (spark 3.2.0 doesn't show this issue), pyspark will hang when running and RDD map operations and converting to a dataframe. Code is below to reproduce. Env: spark 3.2.1 local mode, just run {{./bin/pyspark --driver-memory G --driver-cores }} {{download dataset from here [https://rapidsai-data.s3.us-east-2.amazonaws.com/spark/mortgage.zip]}} {{just 20 rows could reproduce the issue }}{{head -n 20 mortgage_eval_merged.csv > mortgage_eval_merged-small.csv}}{{{} but if the input dataset is small, such 5 rows, it works well.{}}}}}{}}}run codes below: {code:java} path = "//mortgage_eval_merged-small.csv" src_data = sc.textFile(path).map(lambda x:x.split(",")) column_list =['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12','c13','c14','c15','c16','c17','c18','c19','c20','c21','c22','c23','c24','c25','c26','c27','c28'] df = spark.createDataFrame(src_data,column_list) print(df.show(1)){code} was: In spark 3.2.1 (spark 3.2.0 doesn't show this issue), pyspark will hang when running and RDD map operations and converting to a dataframe. Code is below to reproduce. Env: spark 3.2.1 local mode, just run {{./bin/pyspark --driver-memory G --driver-cores }} {{download dataset from here [https://rapidsai-data.s3.us-east-2.amazonaws.com/spark/mortgage.zip]}} {{just 20 rows could reproduce the issue }}{{head -n 20 mortgage_eval_merged.csv > mortgage_eval_merged-small.csv}}{{{} but if the input dataset is small, such 5 rows, it works well.{}}}{{{}{}}}run codes below: {code:java} path = "//mortgage_eval_merged-small.csv" src_data = sc.textFile(path).map(lambda x:x.split(",")) column_list = ['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12','c13','c14','c15','c16','c17','c18','c19','c20','c21','c22','c23','c24','c25','c26','c27','c28'] df = spark.createDataFrame(src_data,column_list) print(df.show(1)){code} > pyspark hangs in local mode running rdd map operation > - > > Key: SPARK-38677 > URL: https://issues.apache.org/jira/browse/SPARK-38677 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1, 3.3.0 >Reporter: Thomas Graves >Priority: Blocker > > In spark 3.2.1 (spark 3.2.0 doesn't show this issue), pyspark will hang when > running and RDD map operations and converting to a dataframe. Code is below > to reproduce. > Env: > spark 3.2.1 local mode, just run {{./bin/pyspark --driver-memory G > --driver-cores }} > {{download dataset from here > [https://rapidsai-data.s3.us-east-2.amazonaws.com/spark/mortgage.zip]}} > {{just 20 rows could reproduce the issue }}{{head -n 20 > mortgage_eval_merged.csv > mortgage_eval_merged-small.csv}}{{{} but if the > input dataset is small, such 5 rows, it works well.{}}}}}{}}}run > codes below: > {code:java} > path = "//mortgage_eval_merged-small.csv" > src_data = sc.textFile(path).map(lambda x:x.split(",")) > column_list > =['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12','c13','c14','c15','c16','c17','c18','c19','c20','c21','c22','c23','c24','c25','c26','c27','c28'] > df = spark.createDataFrame(src_data,column_list) > print(df.show(1)){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38677) pyspark hangs in local mode running rdd map operation
[ https://issues.apache.org/jira/browse/SPARK-38677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513536#comment-17513536 ] Thomas Graves commented on SPARK-38677: --- Note, if you kill the python.daemon process while its hung, it will return to pyspark console and have the right results. I looked through commits in 3.2.1 and it appears that this was introduced by https://issues.apache.org/jira/browse/SPARK-33277 Specifically commit [https://github.com/apache/spark/commit/243c321db2f02f6b4d926114bd37a6e74c2be185] At least I revert that commit and rebuilt and it then works. Also this did not reproduce in standalone mode so it might just be a local mode issue. [~ueshin] [~ankurdave] [~hyukjin.kwon] > pyspark hangs in local mode running rdd map operation > - > > Key: SPARK-38677 > URL: https://issues.apache.org/jira/browse/SPARK-38677 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1, 3.3.0 >Reporter: Thomas Graves >Priority: Blocker > > In spark 3.2.1 (spark 3.2.0 doesn't show this issue), pyspark will hang when > running and RDD map operations and converting to a dataframe. Code is below > to reproduce. > Env: > spark 3.2.1 local mode, just run {{./bin/pyspark --driver-memory G > --driver-cores }} > {{download dataset from here > [https://rapidsai-data.s3.us-east-2.amazonaws.com/spark/mortgage.zip]}} > {{just 20 rows could reproduce the issue }}{{head -n 20 > mortgage_eval_merged.csv > mortgage_eval_merged-small.csv}}{{{} but if the > input dataset is small, such 5 rows, it works well.{}}}{{{}{}}}run codes > below: > {code:java} > path = "//mortgage_eval_merged-small.csv" src_data = > sc.textFile(path).map(lambda x:x.split(",")) column_list = > ['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12','c13','c14','c15','c16','c17','c18','c19','c20','c21','c22','c23','c24','c25','c26','c27','c28'] > df = spark.createDataFrame(src_data,column_list) print(df.show(1)){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38677) pyspark hangs in local mode running rdd map operation
[ https://issues.apache.org/jira/browse/SPARK-38677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-38677: -- Affects Version/s: 3.3.0 > pyspark hangs in local mode running rdd map operation > - > > Key: SPARK-38677 > URL: https://issues.apache.org/jira/browse/SPARK-38677 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1, 3.3.0 >Reporter: Thomas Graves >Priority: Blocker > > In spark 3.2.1 (spark 3.2.0 doesn't show this issue), pyspark will hang when > running and RDD map operations and converting to a dataframe. Code is below > to reproduce. > Env: > spark 3.2.1 local mode, just run {{./bin/pyspark --driver-memory G > --driver-cores }} > {{download dataset from here > [https://rapidsai-data.s3.us-east-2.amazonaws.com/spark/mortgage.zip]}} > {{just 20 rows could reproduce the issue }}{{head -n 20 > mortgage_eval_merged.csv > mortgage_eval_merged-small.csv}}{{{} but if the > input dataset is small, such 5 rows, it works well.{}}}{{{}{}}}run codes > below: > {code:java} > path = "//mortgage_eval_merged-small.csv" src_data = > sc.textFile(path).map(lambda x:x.split(",")) column_list = > ['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12','c13','c14','c15','c16','c17','c18','c19','c20','c21','c22','c23','c24','c25','c26','c27','c28'] > df = spark.createDataFrame(src_data,column_list) print(df.show(1)){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38677) pyspark hangs in local mode running rdd map operation
Thomas Graves created SPARK-38677: - Summary: pyspark hangs in local mode running rdd map operation Key: SPARK-38677 URL: https://issues.apache.org/jira/browse/SPARK-38677 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.1 Reporter: Thomas Graves In spark 3.2.1 (spark 3.2.0 doesn't show this issue), pyspark will hang when running and RDD map operations and converting to a dataframe. Code is below to reproduce. Env: spark 3.2.1 local mode, just run {{./bin/pyspark --driver-memory G --driver-cores }} {{download dataset from here [https://rapidsai-data.s3.us-east-2.amazonaws.com/spark/mortgage.zip]}} {{just 20 rows could reproduce the issue }}{{head -n 20 mortgage_eval_merged.csv > mortgage_eval_merged-small.csv}}{{{} but if the input dataset is small, such 5 rows, it works well.{}}}{{{}{}}}run codes below: {code:java} path = "//mortgage_eval_merged-small.csv" src_data = sc.textFile(path).map(lambda x:x.split(",")) column_list = ['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12','c13','c14','c15','c16','c17','c18','c19','c20','c21','c22','c23','c24','c25','c26','c27','c28'] df = spark.createDataFrame(src_data,column_list) print(df.show(1)){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37618) Support cleaning up shuffle blocks from external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-37618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-37618. --- Fix Version/s: 3.3.0 Assignee: Adam Binford Resolution: Fixed > Support cleaning up shuffle blocks from external shuffle service > > > Key: SPARK-37618 > URL: https://issues.apache.org/jira/browse/SPARK-37618 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Adam Binford >Assignee: Adam Binford >Priority: Major > Fix For: 3.3.0 > > > Currently shuffle data is not cleaned up when an external shuffle service is > used and the associated executor has been deallocated before the shuffle is > cleaned up. Shuffle data is only cleaned up once the application ends. > There have been various issues filed for this: > https://issues.apache.org/jira/browse/SPARK-26020 > https://issues.apache.org/jira/browse/SPARK-17233 > https://issues.apache.org/jira/browse/SPARK-4236 > But shuffle files will still stick around until an application completes. > Dynamic allocation is commonly used for long running jobs (such as structured > streaming), so any long running jobs with a large shuffle involved will > eventually fill up local disk space. The shuffle service already supports > cleaning up shuffle service persisted RDDs, so it should be able to support > cleaning up shuffle blocks as well once the shuffle is removed by the > ContextCleaner. > The current alternative is to use shuffle tracking instead of an external > shuffle service, but this is less optimal from a resource perspective as all > executors must be kept alive until the shuffle has been fully consumed and > cleaned up (and with the default GC interval being 30 minutes this can waste > a lot of time with executors held onto but not doing anything). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38194) Make memory overhead factor configurable
[ https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-38194: -- Fix Version/s: 3.3.0 > Make memory overhead factor configurable > > > Key: SPARK-38194 > URL: https://issues.apache.org/jira/browse/SPARK-38194 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Mesos, YARN >Affects Versions: 3.4.0 >Reporter: Adam Binford >Assignee: Adam Binford >Priority: Major > Fix For: 3.3.0, 3.4.0 > > > Currently if the memory overhead is not provided for a Yarn job, it defaults > to 10% of the respective driver/executor memory. This 10% is hard-coded and > the only way to increase memory overhead is to set the exact memory overhead. > We have seen more than 10% memory being used, and it would be helpful to be > able to set the default overhead factor so that the overhead doesn't need to > be pre-calculated for any driver/executor memory size. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38194) Make Yarn memory overhead factor configurable
[ https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-38194: -- Fix Version/s: 3.3.0 (was: 3.4.0) > Make Yarn memory overhead factor configurable > - > > Key: SPARK-38194 > URL: https://issues.apache.org/jira/browse/SPARK-38194 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.2.1 >Reporter: Adam Binford >Assignee: Adam Binford >Priority: Major > Fix For: 3.3.0 > > > Currently if the memory overhead is not provided for a Yarn job, it defaults > to 10% of the respective driver/executor memory. This 10% is hard-coded and > the only way to increase memory overhead is to set the exact memory overhead. > We have seen more than 10% memory being used, and it would be helpful to be > able to set the default overhead factor so that the overhead doesn't need to > be pre-calculated for any driver/executor memory size. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38194) Make Yarn memory overhead factor configurable
[ https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-38194: -- Fix Version/s: 3.4.0 (was: 3.3.0) > Make Yarn memory overhead factor configurable > - > > Key: SPARK-38194 > URL: https://issues.apache.org/jira/browse/SPARK-38194 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.2.1 >Reporter: Adam Binford >Assignee: Adam Binford >Priority: Major > Fix For: 3.4.0 > > > Currently if the memory overhead is not provided for a Yarn job, it defaults > to 10% of the respective driver/executor memory. This 10% is hard-coded and > the only way to increase memory overhead is to set the exact memory overhead. > We have seen more than 10% memory being used, and it would be helpful to be > able to set the default overhead factor so that the overhead doesn't need to > be pre-calculated for any driver/executor memory size. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38194) Make Yarn memory overhead factor configurable
[ https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-38194. --- Fix Version/s: 3.3.0 Assignee: Adam Binford Resolution: Fixed > Make Yarn memory overhead factor configurable > - > > Key: SPARK-38194 > URL: https://issues.apache.org/jira/browse/SPARK-38194 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.2.1 >Reporter: Adam Binford >Assignee: Adam Binford >Priority: Major > Fix For: 3.3.0 > > > Currently if the memory overhead is not provided for a Yarn job, it defaults > to 10% of the respective driver/executor memory. This 10% is hard-coded and > the only way to increase memory overhead is to set the exact memory overhead. > We have seen more than 10% memory being used, and it would be helpful to be > able to set the default overhead factor so that the overhead doesn't need to > be pre-calculated for any driver/executor memory size. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38379) Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes
[ https://issues.apache.org/jira/browse/SPARK-38379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503229#comment-17503229 ] Thomas Graves commented on SPARK-38379: --- so the issue here is there is a race between when kubernetes call MountVolumesFeatureStep via adding it to the ExecutorPodsLifecycleManager which calls addSubscriber in ExecutorPodsSnapshotsStoreImpl. and when the spark.app.id is actually set in the Spark Context. Here spark context isn't set until after the scheduler backend has started. If its not set the only way to get the appId is to get the one generated in KubernetesClusterSchedulerBackend since that is wha tis ultimately used in spark context to set spark.app.id. I'll investigate a fix. > Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes > -- > > Key: SPARK-38379 > URL: https://issues.apache.org/jira/browse/SPARK-38379 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: Thomas Graves >Priority: Major > > I'm using Spark 3.2.1 on a kubernetes cluster and starting a spark-shell in > client mode. I'm using persistent local volumes to mount nvme under /data in > the executors and on startup the driver always throws the warning below. > using these options: > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=fast-disks > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false > > > {code:java} > 22/03/01 20:21:22 WARN ExecutorPodsSnapshotsStoreImpl: Exception when > notifying snapshot subscriber. > java.util.NoSuchElementException: spark.app.id > at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.SparkConf.get(SparkConf.scala:245) > at org.apache.spark.SparkConf.getAppId(SparkConf.scala:450) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.$anonfun$constructVolumes$4(MountVolumesFeatureStep.scala:88) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.constructVolumes(MountVolumesFeatureStep.scala:57) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.configurePod(MountVolumesFeatureStep.scala:34) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.$anonfun$buildFromFeatures$4(KubernetesExecutorBuilder.scala:64) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:91) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:63) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:391) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.c
[jira] [Commented] (SPARK-34960) Aggregate (Min/Max/Count) push down for ORC
[ https://issues.apache.org/jira/browse/SPARK-34960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503221#comment-17503221 ] Thomas Graves commented on SPARK-34960: --- if I'm reading the orc spec right the ColumnStatistics footer are optional in Orc, correct? I'm assuming that is why PR says "If the file does not have valid statistics, Spark will throw exception and fail query." I guess the only way to know its there or not is to read it so we can't really determine ahead of time? This seems like behavior that should be documented in the very least. I want to make sure I'm not missing something here. > Aggregate (Min/Max/Count) push down for ORC > --- > > Key: SPARK-34960 > URL: https://issues.apache.org/jira/browse/SPARK-34960 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > Fix For: 3.3.0 > > > Similar to Parquet (https://issues.apache.org/jira/browse/SPARK-34952), we > can also push down certain aggregations into ORC. ORC exposes column > statistics in interface `org.apache.orc.Reader` > ([https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Reader.java#L118] > ), where Spark can utilize for aggregation push down. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36645) Aggregate (Min/Max/Count) push down for Parquet
[ https://issues.apache.org/jira/browse/SPARK-36645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-36645: -- Summary: Aggregate (Min/Max/Count) push down for Parquet (was: Aggregate (Count) push down for Parquet) > Aggregate (Min/Max/Count) push down for Parquet > --- > > Key: SPARK-36645 > URL: https://issues.apache.org/jira/browse/SPARK-36645 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.3.0 > > > Push down Aggregate (Min/Max/Count) for Parquet for performance improvement -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36645) Aggregate (Min/Max/Count) push down for Parquet
[ https://issues.apache.org/jira/browse/SPARK-36645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503212#comment-17503212 ] Thomas Graves edited comment on SPARK-36645 at 3/8/22, 10:52 PM: - Note it appears this only really pushes down count and min and max for some types because: Parquet Binary min/max could be truncated. We may get wrong result if we rely on parquet Binary min/max. I'm going to update the title to reflect this. was (Author: tgraves): Note it appears this only really pushes down count because: Parquet Binary min/max could be truncated. We may get wrong result if we rely on parquet Binary min/max. I'm going to update the title to reflect this. > Aggregate (Min/Max/Count) push down for Parquet > --- > > Key: SPARK-36645 > URL: https://issues.apache.org/jira/browse/SPARK-36645 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.3.0 > > > Push down Aggregate (Min/Max/Count) for Parquet for performance improvement -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36645) Aggregate (Count) push down for Parquet
[ https://issues.apache.org/jira/browse/SPARK-36645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-36645: -- Summary: Aggregate (Count) push down for Parquet (was: Aggregate (Min/Max/Count) push down for Parquet) > Aggregate (Count) push down for Parquet > --- > > Key: SPARK-36645 > URL: https://issues.apache.org/jira/browse/SPARK-36645 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.3.0 > > > Push down Aggregate (Min/Max/Count) for Parquet for performance improvement -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36645) Aggregate (Min/Max/Count) push down for Parquet
[ https://issues.apache.org/jira/browse/SPARK-36645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503212#comment-17503212 ] Thomas Graves commented on SPARK-36645: --- Note it appears this only really pushes down count because: Parquet Binary min/max could be truncated. We may get wrong result if we rely on parquet Binary min/max. I'm going to update the title to reflect this. > Aggregate (Min/Max/Count) push down for Parquet > --- > > Key: SPARK-36645 > URL: https://issues.apache.org/jira/browse/SPARK-36645 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.3.0 > > > Push down Aggregate (Min/Max/Count) for Parquet for performance improvement -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (TEZ-3362) Delete intermediate data at DAG level for Shuffle Handler
[ https://issues.apache.org/jira/browse/TEZ-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503138#comment-17503138 ] Thomas Graves commented on TEZ-3362: [~jeagles] [~kshukla] I know its been a while, I was looking at this feature and I'm wondering how this works on a secure yarn setup? Generally the files and directories are read/write by the user and read only by the Hadoop group. If this runs in the auxiliary shuffle handler in the node manager it wouldn't have permissions to remove the directories. Is this somehow relying on other permission or configuration changes or is it running the remove as the user and I'm not seeing it? > Delete intermediate data at DAG level for Shuffle Handler > - > > Key: TEZ-3362 > URL: https://issues.apache.org/jira/browse/TEZ-3362 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Turner Eagles >Assignee: Kuhu Shukla >Priority: Major > Fix For: 0.9.0 > > Attachments: TEZ-3362.001.patch, TEZ-3362.002.patch, > TEZ-3362.003.patch, TEZ-3362.004.patch, TEZ-3362.005.patch, > TEZ-3362.006.patch, TEZ-3362.007.patch, TEZ-3362.008.patch > > > Applications like hive that use tez in session mode need the ability to > delete intermediate data after a DAG completes and while the application > continues to run. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SPARK-38379) Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes
[ https://issues.apache.org/jira/browse/SPARK-38379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502555#comment-17502555 ] Thomas Graves commented on SPARK-38379: --- so I actually created another pod with Spark client in it and use the spark-shell. [https://spark.apache.org/docs/3.2.1/running-on-kubernetes.html#client-mode] Only thing I had to do was make sure ports were available. Since you don't run in this mode, I can investigate more. > Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes > -- > > Key: SPARK-38379 > URL: https://issues.apache.org/jira/browse/SPARK-38379 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: Thomas Graves >Priority: Major > > I'm using Spark 3.2.1 on a kubernetes cluster and starting a spark-shell in > client mode. I'm using persistent local volumes to mount nvme under /data in > the executors and on startup the driver always throws the warning below. > using these options: > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=fast-disks > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false > > > {code:java} > 22/03/01 20:21:22 WARN ExecutorPodsSnapshotsStoreImpl: Exception when > notifying snapshot subscriber. > java.util.NoSuchElementException: spark.app.id > at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.SparkConf.get(SparkConf.scala:245) > at org.apache.spark.SparkConf.getAppId(SparkConf.scala:450) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.$anonfun$constructVolumes$4(MountVolumesFeatureStep.scala:88) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.constructVolumes(MountVolumesFeatureStep.scala:57) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.configurePod(MountVolumesFeatureStep.scala:34) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.$anonfun$buildFromFeatures$4(KubernetesExecutorBuilder.scala:64) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:91) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:63) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:391) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:339) > at
[jira] [Commented] (SPARK-38379) Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes
[ https://issues.apache.org/jira/browse/SPARK-38379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500354#comment-17500354 ] Thomas Graves commented on SPARK-38379: --- just going by the stack trace this looks related to change https://issues.apache.org/jira/browse/SPARK-35182 [~dongjoon] Just curious if you have run into this? > Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes > -- > > Key: SPARK-38379 > URL: https://issues.apache.org/jira/browse/SPARK-38379 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: Thomas Graves >Priority: Major > > I'm using Spark 3.2.1 on a kubernetes cluster and starting a spark-shell in > client mode. I'm using persistent local volumes to mount nvme under /data in > the executors and on startup the driver always throws the warning below. > using these options: > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=fast-disks > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false > > > {code:java} > 22/03/01 20:21:22 WARN ExecutorPodsSnapshotsStoreImpl: Exception when > notifying snapshot subscriber. > java.util.NoSuchElementException: spark.app.id > at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.SparkConf.get(SparkConf.scala:245) > at org.apache.spark.SparkConf.getAppId(SparkConf.scala:450) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.$anonfun$constructVolumes$4(MountVolumesFeatureStep.scala:88) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.constructVolumes(MountVolumesFeatureStep.scala:57) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.configurePod(MountVolumesFeatureStep.scala:34) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.$anonfun$buildFromFeatures$4(KubernetesExecutorBuilder.scala:64) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:91) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:63) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:391) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:339) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:117) >
[jira] [Created] (SPARK-38379) Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes
Thomas Graves created SPARK-38379: - Summary: Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes Key: SPARK-38379 URL: https://issues.apache.org/jira/browse/SPARK-38379 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.2.1 Reporter: Thomas Graves I'm using Spark 3.2.1 on a kubernetes cluster and starting a spark-shell in client mode. I'm using persistent local volumes to mount nvme under /data in the executors and on startup the driver always throws the warning below. using these options: --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=fast-disks \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false {code:java} 22/03/01 20:21:22 WARN ExecutorPodsSnapshotsStoreImpl: Exception when notifying snapshot subscriber. java.util.NoSuchElementException: spark.app.id at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.SparkConf.get(SparkConf.scala:245) at org.apache.spark.SparkConf.getAppId(SparkConf.scala:450) at org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.$anonfun$constructVolumes$4(MountVolumesFeatureStep.scala:88) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.constructVolumes(MountVolumesFeatureStep.scala:57) at org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.configurePod(MountVolumesFeatureStep.scala:34) at org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.$anonfun$buildFromFeatures$4(KubernetesExecutorBuilder.scala:64) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) at scala.collection.immutable.List.foldLeft(List.scala:91) at org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:63) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:391) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:339) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:117) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3$adapted(ExecutorPodsAllocator.scala:117) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber$$processSnapshotsInternal(ExecutorPodsSnapshotsStoreImpl.scala:138) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.processSnapshots(ExecutorPodsSnapshotsStoreImpl.scala:126) at org.apache.spark.scheduler.
[jira] [Commented] (SPARK-37461) yarn-client mode client's appid value is null
[ https://issues.apache.org/jira/browse/SPARK-37461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451172#comment-17451172 ] Thomas Graves commented on SPARK-37461: --- [~angerszhuuu] please add a description to this issue. > yarn-client mode client's appid value is null > - > > Key: SPARK-37461 > URL: https://issues.apache.org/jira/browse/SPARK-37461 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37260) PYSPARK Arrow 3.2.0 docs link invalid
Thomas Graves created SPARK-37260: - Summary: PYSPARK Arrow 3.2.0 docs link invalid Key: SPARK-37260 URL: https://issues.apache.org/jira/browse/SPARK-37260 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 3.2.0 Reporter: Thomas Graves [http://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html] links to: [https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html] which links to: [https://spark.apache.org/docs/latest/api/python/sql/arrow_pandas.rst] But that is an invalid link. I assume its supposed to point to: https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37208) Support mapping Spark gpu/fpga resource types to custom YARN resource type
[ https://issues.apache.org/jira/browse/SPARK-37208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438784#comment-17438784 ] Thomas Graves commented on SPARK-37208: --- Note, I'm working on this. > Support mapping Spark gpu/fpga resource types to custom YARN resource type > -- > > Key: SPARK-37208 > URL: https://issues.apache.org/jira/browse/SPARK-37208 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > Currently Spark supports gpu/fpga resource scheduling and specifically on > YARN it knows how to map gpu/fpga to the YARN resource types yarn.io/gpu and > yarn.io/fpga. YARN also supports custom resource types and in Hadoop 3.3.1 > made it easier for users to plugin in custom resource types. This means users > may create a custom resource type that represents a GPU or FPGAs because they > want additional logic that YARN the built in versions don't have. Ideally > Spark users still just use the generic "gpu" or "fpga" types in Spark. So we > should add the ability to change the Spark internal mappings. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37208) Support mapping Spark gpu/fpga resource types to custom YARN resource type
Thomas Graves created SPARK-37208: - Summary: Support mapping Spark gpu/fpga resource types to custom YARN resource type Key: SPARK-37208 URL: https://issues.apache.org/jira/browse/SPARK-37208 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 3.0.0 Reporter: Thomas Graves Currently Spark supports gpu/fpga resource scheduling and specifically on YARN it knows how to map gpu/fpga to the YARN resource types yarn.io/gpu and yarn.io/fpga. YARN also supports custom resource types and in Hadoop 3.3.1 made it easier for users to plugin in custom resource types. This means users may create a custom resource type that represents a GPU or FPGAs because they want additional logic that YARN the built in versions don't have. Ideally Spark users still just use the generic "gpu" or "fpga" types in Spark. So we should add the ability to change the Spark internal mappings. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36540) AM should not just finish with Success when dissconnected
[ https://issues.apache.org/jira/browse/SPARK-36540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-36540. --- Fix Version/s: 3.3.0 Assignee: angerszhu Resolution: Fixed > AM should not just finish with Success when dissconnected > - > > Key: SPARK-36540 > URL: https://issues.apache.org/jira/browse/SPARK-36540 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, YARN >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > > We meet a case AM lose connection > {code} > 21/08/18 02:14:15 ERROR TransportRequestHandler: Error sending result > RpcResponse{requestId=5675952834716124039, > body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]}} to > xx.xx.xx.xx:41420; closing connection > java.nio.channels.ClosedChannelException > at > io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) > at > io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1104) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:748) > {code} > Check the code about client, when AMEndpoint dissconnected, will finish > Application with SUCCESS final status > {code} > override def onDisconnected(remoteAddress: RpcAddress): Unit = { > // In cluster mode or unmanaged am case, do not rely on the > disassociated event to exit > // This avoids potentially reporting incorrect exit codes if the driver > fails > if (!(isClusterMode || sparkConf.get(YARN_UNMANAGED_AM))) { > logInfo(s"Driver terminated or disconnected! Shutting down. > $remoteAddress") > finish(FinalApplicationStatus.SUCCEEDED, > ApplicationMaster.EXIT_SUCCESS) > } > } > {code} > Nomally in client mode, when application success, driver will stop and AM > loss connection, it's ok that exit with SUCCESS, but if there is a not work > problem cause dissconnected. Still finish with final status is not correct. > Then YarnClientSchedulerBackend will receive application report with final > status with success and stop SparkContext cause application failed but mark > it as a normal stop. > {code} > private class MonitorThread extends Thread { > private var allowInterrupt = true > override def run() { > try { > val YarnAppReport(_, state, diags) = > client.monitorApplication(appId.get, logApplicationReport = false) > logError(s"YARN application has exited unexpectedly with state > $state! " + > "Check the YARN application logs for more details.") > diags.foreach { err => > logError(s"Diagnostics message: $err") > } > allowInterrupt = false > sc.stop() > } catch { > case e: InterruptedException => logInfo("Interrupting monitor thread") > } > } > def stopMonitor(): Unit = { > if (allowInterrupt) { > this.interrupt() > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36624) When application killed, sc should not exit with code 0
[ https://issues.apache.org/jira/browse/SPARK-36624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-36624. --- Fix Version/s: 3.3.0 Assignee: angerszhu Resolution: Fixed > When application killed, sc should not exit with code 0 > --- > > Key: SPARK-36624 > URL: https://issues.apache.org/jira/browse/SPARK-36624 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, YARN >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > > When application killed, sc should not exit with code 0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36817) Does Apache Spark 3 support GPU usage for Spark RDDs?
[ https://issues.apache.org/jira/browse/SPARK-36817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420737#comment-17420737 ] Thomas Graves commented on SPARK-36817: --- please refer to https://github.com/NVIDIA/spark-rapids/issues/35791 > Does Apache Spark 3 support GPU usage for Spark RDDs? > - > > Key: SPARK-36817 > URL: https://issues.apache.org/jira/browse/SPARK-36817 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Abhishek Shakya >Priority: Major > > I am currently trying to run genomic analyses pipelines using > [Hail|https://hail.is/](library for genomics analyses written in python and > Scala). Recently, Apache Spark 3 was released and it supported GPU usage. > I tried [spark-rapids|https://nvidia.github.io/spark-rapids/] library start > an on-premise slurm cluster with gpu nodes. I was able to initialise the > cluster. However, when I tried running hail tasks, the executors keep getting > killed. > On querying in Hail forum, I got the response that > {quote}That’s a GPU code generator for Spark-SQL, and Hail doesn’t use any > Spark-SQL interfaces, only the RDD interfaces. > {quote} > So, does Spark3 not support GPU usage for RDD interfaces? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN
[ https://issues.apache.org/jira/browse/SPARK-35672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reopened SPARK-35672: --- > Spark fails to launch executors with very large user classpath lists on YARN > > > Key: SPARK-35672 > URL: https://issues.apache.org/jira/browse/SPARK-35672 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 3.1.2 > Environment: Linux RHEL7 > Spark 3.1.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.2.0, 3.1.3 > > > When running Spark on YARN, the {{user-class-path}} argument to > {{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to > executor processes. The argument is specified once for each JAR, and the URIs > are fully-qualified, so the paths can be quite long. With large user JAR > lists (say 1000+), this can result in system-level argument length limits > being exceeded, typically manifesting as the error message: > {code} > /bin/bash: Argument list too long > {code} > A [Google > search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22] > indicates that this is not a theoretical problem and afflicts real users, > including ours. This issue was originally observed on Spark 2.3, but has been > confirmed to exist in the master branch as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN
[ https://issues.apache.org/jira/browse/SPARK-35672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419312#comment-17419312 ] Thomas Graves commented on SPARK-35672: --- Ok, sounds like we should revert then so this doesn't block 3.2 release > Spark fails to launch executors with very large user classpath lists on YARN > > > Key: SPARK-35672 > URL: https://issues.apache.org/jira/browse/SPARK-35672 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 3.1.2 > Environment: Linux RHEL7 > Spark 3.1.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.2.0, 3.1.3 > > > When running Spark on YARN, the {{user-class-path}} argument to > {{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to > executor processes. The argument is specified once for each JAR, and the URIs > are fully-qualified, so the paths can be quite long. With large user JAR > lists (say 1000+), this can result in system-level argument length limits > being exceeded, typically manifesting as the error message: > {code} > /bin/bash: Argument list too long > {code} > A [Google > search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22] > indicates that this is not a theoretical problem and afflicts real users, > including ours. This issue was originally observed on Spark 2.3, but has been > confirmed to exist in the master branch as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36772) FinalizeShuffleMerge fails with an exception due to attempt id not matching
[ https://issues.apache.org/jira/browse/SPARK-36772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-36772: -- Target Version/s: 3.2.0 > FinalizeShuffleMerge fails with an exception due to attempt id not matching > --- > > Key: SPARK-36772 > URL: https://issues.apache.org/jira/browse/SPARK-36772 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Mridul Muralidharan >Priority: Blocker > > As part of driver request to external shuffle services (ESS) to finalize the > merge, it also passes its [application attempt > id|https://github.com/apache/spark/blob/3f09093a21306b0fbcb132d4c9f285e56ac6b43c/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockStoreClient.java#L180] > so that ESS can validate the request is from the correct attempt. > This attempt id is fetched from the TransportConf passed in when creating the > [ExternalBlockStoreClient|https://github.com/apache/spark/blob/67421d80b8935d91b86e8cd3becb211fa2abd54f/core/src/main/scala/org/apache/spark/SparkEnv.scala#L352] > - and the transport conf leverages a [cloned > copy|https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/core/src/main/scala/org/apache/spark/network/netty/SparkTransportConf.scala#L47] > of the SparkConf passed to it. > Application attempt id is set as part of SparkContext > [initialization|https://github.com/apache/spark/blob/67421d80b8935d91b86e8cd3becb211fa2abd54f/core/src/main/scala/org/apache/spark/SparkContext.scala#L586]. > But this happens after driver SparkEnv has [already been > created|https://github.com/apache/spark/blob/67421d80b8935d91b86e8cd3becb211fa2abd54f/core/src/main/scala/org/apache/spark/SparkContext.scala#L460]. > Hence the attempt id that ExternalBlockStoreClient uses will always end up > being -1 : which will not match the attempt id at ESS (which is based on > spark.app.attempt.id) : resulting in merge finalization to always fail (" > java.lang.IllegalArgumentException: The attempt id -1 in this > FinalizeShuffleMerge message does not match with the current attempt id 1 > stored in shuffle service for application ...") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36666) [SQL] Regression in AQEShuffleReadExec
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-3: -- Priority: Blocker (was: Major) > [SQL] Regression in AQEShuffleReadExec > -- > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Andy Grove >Priority: Blocker > > I am currently testing the RAPIDS Accelerator for Apache Spark with the Spark > 3.2 release candidate and there is a regression in AQEShuffleReadExec where > it now throws an exception if the shuffle's output partitioning does not > match a specific list of schemes. > The problem can be solved by returning UnknownPartitioning, as it already > does in some cases, rather than throwing an exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36622) spark.history.kerberos.principal doesn't take value _HOST
[ https://issues.apache.org/jira/browse/SPARK-36622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408813#comment-17408813 ] Thomas Graves commented on SPARK-36622: --- Supported _HOST for SHS likely makes sense since its a server. > spark.history.kerberos.principal doesn't take value _HOST > - > > Key: SPARK-36622 > URL: https://issues.apache.org/jira/browse/SPARK-36622 > Project: Spark > Issue Type: Improvement > Components: Deploy, Security, Spark Core >Affects Versions: 3.0.1, 3.1.2 >Reporter: pralabhkumar >Priority: Minor > > spark.history.kerberos.principal doesn't understand value _HOST. > It says failure to login for principal : spark/_HOST@realm . > It will be helpful to take _HOST value via config file and change it with > current hostname(similar to what Hive does) . This will also help to run SHS > on multiple machines without hardcoding principal hostname. > .spark.history.kerberos.principal > > It require minor change in HistoryServer.scala in initSecurity method . > > Please let me know , if this request make sense , I'll create the PR . > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32333) Drop references to Master
[ https://issues.apache.org/jira/browse/SPARK-32333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405814#comment-17405814 ] Thomas Graves commented on SPARK-32333: --- I was looking to break this up into subtasks but not sure how much we will be able to, perhaps something like this: Note we need to keep backwards compatibility so rename is copy for any public api's/scripts. Also note, one thing we may not want to change or discuss more is rename ApplicationMaster since that is the name Hadoop uses for it. We can certainly change internal api's and functions but external we may not want to. 1) Rename the standalone Master and public apis (SparkConf.setMaster), scripts, docs that reference those scripts 2) Rename standalone Master classes that aren't public - MasterArguments, MasterUI, MasterMessages, etc 3) Rename internal classes and messages with Master in name, note we could also rename class with Master in the name of them like BlockManagerMaster, MapOutputTrackerMaster, etc. We could break this up further if people are interesting in helping. > Drop references to Master > - > > Key: SPARK-32333 > URL: https://issues.apache.org/jira/browse/SPARK-32333 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > We have a lot of references to "master" in the code base. It will be > beneficial to remove references to problematic language that can alienate > potential community members. > SPARK-32004 removed references to slave > > Here is a IETF draft to fix up some of the most egregious examples > (master/slave, whitelist/backlist) with proposed alternatives. > https://tools.ietf.org/id/draft-knodel-terminology-00.html#rfc.section.1.1.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36446) YARN shuffle server restart crashes all dynamic allocation jobs that have deallocated an executor
[ https://issues.apache.org/jira/browse/SPARK-36446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397461#comment-17397461 ] Thomas Graves commented on SPARK-36446: --- [~adamkennedy77] ^ > YARN shuffle server restart crashes all dynamic allocation jobs that have > deallocated an executor > - > > Key: SPARK-36446 > URL: https://issues.apache.org/jira/browse/SPARK-36446 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.4.8, 3.1.2 >Reporter: Adam Kennedy >Priority: Critical > > When dynamic allocation is enabled, executors that deallocate rely on the > shuffle server to hold blocks and supply them to remaining executors. > When YARN Shuffle Server restarts (either intentionally or due to a crash), > it loses block information and relies on being able to contact Executors (the > locations of which it durably stores) to refetch the list of blocks. > This mutual dependency on the other to hold block information fails fatally > under some common scenarios. > For example, if a Spark application is running under dynamic allocation, some > amount of executors will almost always shut down. > If, after this has occurred, any shuffle server crashes, or is restarted > (either directly when running as a standalone service, or as part of a YARN > node manager restart) then there is no way to restore block data and it is > permanently lost. > Worse, when Executors try to fetch blocks from the shuffle server, the > shuffle server cannot location the exeutor, decides it doesn't exist, treats > it as a fatal exception, and causes the application to terminate and crash. > Thus, in a real world scenario that we observe on a 1000+ node multi-tenant > cluster where dynamic allocation is on by default, a rolling restart of the > YARN node managers will cause ALL jobs that have deallocated any executor and > have shuffles or transferred blocks to the shuffle server in order to shut > down, to crash. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36446) YARN shuffle server restart crashes all dynamic allocation jobs that have deallocated an executor
[ https://issues.apache.org/jira/browse/SPARK-36446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394931#comment-17394931 ] Thomas Graves commented on SPARK-36446: --- Is this with the yarn nodemangar recovery enabled? ie yarn stores the necessarily information in a database which on restart it loads back up, if that is not configured it will not remember blocks. > YARN shuffle server restart crashes all dynamic allocation jobs that have > deallocated an executor > - > > Key: SPARK-36446 > URL: https://issues.apache.org/jira/browse/SPARK-36446 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.4.8, 3.1.2 >Reporter: Adam Kennedy >Priority: Critical > > When dynamic allocation is enabled, executors that deallocate rely on the > shuffle server to hold blocks and supply them to remaining executors. > When YARN Shuffle Server restarts (either intentionally or due to a crash), > it loses block information and relies on being able to contact Executors (the > locations of which it durably stores) to refetch the list of blocks. > This mutual dependency on the other to hold block information fails fatally > under some common scenarios. > For example, if a Spark application is running under dynamic allocation, some > amount of executors will almost always shut down. > If, after this has occurred, any shuffle server crashes, or is restarted > (either directly when running as a standalone service, or as part of a YARN > node manager restart) then there is no way to restore block data and it is > permanently lost. > Worse, when Executors try to fetch blocks from the shuffle server, the > shuffle server cannot location the exeutor, decides it doesn't exist, treats > it as a fatal exception, and causes the application to terminate and crash. > Thus, in a real world scenario that we observe on a 1000+ node multi-tenant > cluster where dynamic allocation is on by default, a rolling restart of the > YARN node managers will cause ALL jobs that have deallocated any executor and > have shuffles or transferred blocks to the shuffle server in order to shut > down, to crash. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-595) Document "local-cluster" mode
[ https://issues.apache.org/jira/browse/SPARK-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-595: --- Assignee: Yuto Akutsu > Document "local-cluster" mode > - > > Key: SPARK-595 > URL: https://issues.apache.org/jira/browse/SPARK-595 > Project: Spark > Issue Type: New Feature > Components: Documentation >Affects Versions: 0.6.0 >Reporter: Josh Rosen >Assignee: Yuto Akutsu >Priority: Minor > Fix For: 3.2.0, 3.3.0 > > > The 'Spark Standalone Mode' guide describes how to manually launch a > standalone cluster, which can be done locally for testing, but it does not > mention SparkContext's `local-cluster` option. > What are the differences between these approaches? Which one should I prefer > for local testing? Can I still use the standalone web interface if I use > 'local-cluster' mode? > It would be useful to document this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-595) Document "local-cluster" mode
[ https://issues.apache.org/jira/browse/SPARK-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-595. - Fix Version/s: 3.3.0 3.2.0 Resolution: Fixed > Document "local-cluster" mode > - > > Key: SPARK-595 > URL: https://issues.apache.org/jira/browse/SPARK-595 > Project: Spark > Issue Type: New Feature > Components: Documentation >Affects Versions: 0.6.0 >Reporter: Josh Rosen >Priority: Minor > Fix For: 3.2.0, 3.3.0 > > > The 'Spark Standalone Mode' guide describes how to manually launch a > standalone cluster, which can be done locally for testing, but it does not > mention SparkContext's `local-cluster` option. > What are the differences between these approaches? Which one should I prefer > for local testing? Can I still use the standalone web interface if I use > 'local-cluster' mode? > It would be useful to document this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-595) Document "local-cluster" mode
[ https://issues.apache.org/jira/browse/SPARK-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reopened SPARK-595: - > Document "local-cluster" mode > - > > Key: SPARK-595 > URL: https://issues.apache.org/jira/browse/SPARK-595 > Project: Spark > Issue Type: New Feature > Components: Documentation >Affects Versions: 0.6.0 >Reporter: Josh Rosen >Priority: Minor > > The 'Spark Standalone Mode' guide describes how to manually launch a > standalone cluster, which can be done locally for testing, but it does not > mention SparkContext's `local-cluster` option. > What are the differences between these approaches? Which one should I prefer > for local testing? Can I still use the standalone web interface if I use > 'local-cluster' mode? > It would be useful to document this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org