[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19649#discussion_r149315115 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/events.scala --- @@ -110,7 +120,31 @@ case class RenameTableEvent( extends TableEvent /** - * Event fired when a function is created, dropped or renamed. + * Enumeration to indicate which part of table is altered. If a plain alterTable API is called, then + * type will generally be Table. + */ +object AlterTableKind extends Enumeration { + val Table, DataSchema, Stats = Value --- End diff -- I'm OK to use String, but I'd prefer strong type to avoid nasty issues. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19663: [SPARK-21888][YARN][SQL][Hive]add hadoop/hive/hba...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19663#discussion_r149017279 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -705,6 +705,19 @@ private[spark] class Client( } } +val confDir = + sys.env.getOrElse("SPARK_CONF_DIR", sys.env("SPARK_HOME") + File.separator + "conf") +val dir = new File(confDir) +if (dir.isDirectory) { + val files = dir.listFiles(new FileFilter { +override def accept(pathname: File): Boolean = { + pathname.isFile && pathname.getName.endsWith("xml") --- End diff -- Yes, I understand. My question is that do we need to explicitly check the expected file names, rather than blindly match any xml file? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19663: [SPARK-21888][SQL][Hive]add hadoop/hive/hbase/etc config...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19663 Please also add [YARN] tag to the PR title, this is actually a yarn problem. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19663: [SPARK-21888][SQL][Hive]add hadoop/hive/hbase/etc...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19663#discussion_r149015858 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -705,6 +705,19 @@ private[spark] class Client( } } +val confDir = + sys.env.getOrElse("SPARK_CONF_DIR", sys.env("SPARK_HOME") + File.separator + "conf") +val dir = new File(confDir) +if (dir.isDirectory) { + val files = dir.listFiles(new FileFilter { +override def accept(pathname: File): Boolean = { + pathname.isFile && pathname.getName.endsWith("xml") --- End diff -- Shall we explicitly match the file name, like "hive-site.xml"? Looks like only check file name ends with "xml" will also include other unwanted files indefinitely. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18791: [SPARK-21571][Scheduler] Spark history server leaves inc...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/18791 @ericvandenbergfb please also fix the PR title, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19649#discussion_r149009631 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogEventSuite.scala --- @@ -104,6 +109,8 @@ class ExternalCatalogEventSuite extends SparkFunSuite { tableType = CatalogTableType.MANAGED, storage = storage, schema = new StructType().add("id", "long")) +val tableDefWithSparkVersion = --- End diff -- Sorry this was from my original code, will update it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19649#discussion_r149004933 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala --- @@ -158,7 +173,13 @@ abstract class ExternalCatalog * @param table Name of table to alter schema for * @param newDataSchema Updated data schema to be used for the table. */ - def alterTableDataSchema(db: String, table: String, newDataSchema: StructType): Unit + final def alterTableDataSchema(db: String, table: String, newDataSchema: StructType): Unit = { +postToAll(AlterTableSchemaPreEvent(db, table)) --- End diff -- For me I think it is not so necessary to carry the new schema, we can query the catalog by `db` and `table` to get this newly set schema. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19649#discussion_r149003803 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala --- @@ -147,7 +154,15 @@ abstract class ExternalCatalog * Note: If the underlying implementation does not support altering a certain field, * this becomes a no-op. */ - def alterTable(tableDefinition: CatalogTable): Unit + final def alterTable(tableDefinition: CatalogTable): Unit = { --- End diff -- @cloud-fan , since now we expose `alterTable` interface for other components to leverage, if we don't track this, then looks like we missed a piece of `ExternalCatalogEvent`s. I think for now we can add this `AlterTableEvent`, later on if we removed this method, then we can make this event a no-op (only kept for compatibility), what do you think? @wzhfy , I was thinking to add partition related events, but I'm not clearly sure why this whole piece is missing and is it necessary to add partition related events? If we have an agreement on such events, I'm OK to add them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
GitHub user jerryshao opened a pull request: https://github.com/apache/spark/pull/19649 [SPARK-22405][SQL] Add more ExternalCatalogEvent ## What changes were proposed in this pull request? We're building a data lineage tool in which we need to monitor the metadata changes in ExternalCatalog, current ExternalCatalog already provides several useful events like "CreateDatabaseEvent" for custom SparkListener to use. But still there's some event missing, like alter database event and alter table event. So here propose to and new ExternalCatalogEvent. ## How was this patch tested? Enrich the current UT and tested on local cluster. CC @hvanhovell please let me know your comments about current proposal, thanks. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jerryshao/apache-spark SPARK-22405 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19649.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19649 commit 5c628be6b6838b224a27e06731f686a5182e1bad Author: jerryshao <ss...@hortonworks.com> Date: 2017-11-03T01:48:48Z Add more ExternalCatalogEvent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19586 I tend to agree with @cloud-fan , I think you can implement your own serializer out of Spark to be more specialized for your application, that will definitely be more efficient than the built-in one. But for the Spark's default solution, it should be general enough to cover all cases. Setting a flag or a configuration is not intuitive enough from my understanding. And for ML, can you please provide an example about how this could be improved with your approach. From my understanding you approach is more useful when leverage custom class definition, like `Person` in your example. But for ML/SQL cases, all the types should be predefined or primitives, will that improved a lot? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19396: [SPARK-22172][CORE] Worker hangs when the external shuff...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19396 Sorry I didn't notice it, will double-check next time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19396: [SPARK-22172][CORE] Worker hangs when the external shuff...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19396 OK, let me merge to master branch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19586 Using configurations seems not so elegant, also configuration is application based, how would you turn off/on this feature in the runtime? Sorry I cannot give you a good advice, maybe kryo's solution is the best option for general case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19396: [SPARK-22172][CORE] Worker hangs when the external shuff...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19396 I'm OK with the current changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19586 @ConeyLiu what about the below example, does your implementation support this? ```scala trait Base { val name: String } case class A(name: String) extends Base case class B(name: String) extends Base sc.parallelize(Seq(A("a"), B("b"))).map { i => (i, 1) }.reduceByKey(_ + _).collect() ``` Here not all the elements have same class type, does your PR here support such scenario? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19580: [SPARK-11334][CORE] Fix bug in Executor allocation manag...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19580 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19580: [SPARK-11334][CORE] Fix bug in Executor allocation manag...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19580 jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19580: [SPARK-11334][CORE] Fix bug in Executor allocatio...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19580#discussion_r147325260 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -267,6 +267,10 @@ private[spark] class ExecutorAllocationManager( (numRunningOrPendingTasks + tasksPerExecutor - 1) / tasksPerExecutor } + private def totalRunningTasks(): Int = synchronized { --- End diff -- I'm not sure why do we need to add a method which only used for unit test. If want to verify the behavior of `totalRunningTasks`, I think `maxNumExecutorsNeeded` can also be used indirectly for verification. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19580: [SPARK-11334][CORE] Fix bug in Executor allocatio...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19580#discussion_r147304200 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -267,6 +267,10 @@ private[spark] class ExecutorAllocationManager( (numRunningOrPendingTasks + tasksPerExecutor - 1) / tasksPerExecutor } + private def totalRunningTasks(): Int = synchronized { --- End diff -- Looks like no one invoke this method? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19580: [SPARK-11334][CORE] Fix bug in Executor allocatio...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19580#discussion_r147303973 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -678,7 +679,9 @@ private[spark] class ExecutorAllocationManager( val executorId = taskStart.taskInfo.executorId allocationManager.synchronized { -numRunningTasks += 1 +if (stageIdToNumRunningTask.contains(stageId)) { + stageIdToNumRunningTask(stageId) = stageIdToNumRunningTask(stageId) + 1 --- End diff -- nit: this can be changed to `stageIdToNumRunningTask(stageId) += 1` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19580: [SPARK-11334][CORE] Fix bug in Executor allocatio...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19580#discussion_r147304306 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -709,7 +712,9 @@ private[spark] class ExecutorAllocationManager( val taskIndex = taskEnd.taskInfo.index val stageId = taskEnd.stageId allocationManager.synchronized { -numRunningTasks -= 1 +if (stageIdToNumRunningTask.contains(stageId)) { + stageIdToNumRunningTask(stageId) = stageIdToNumRunningTask(stageId) - 1 --- End diff -- ditto. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19519: [SPARK-21840][core] Add trait that allows conf to be dir...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19519 LGTM, merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19534 @sitalkedia would you please reopen this PR, I think the second issue I fixed before is not valid anymore, for the first issue the fix is no difference compared to here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11205: [SPARK-11334][Core] Handle maximum task failure s...
Github user jerryshao closed the pull request at: https://github.com/apache/spark/pull/11205 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/11205 Verified again, looks like the 2nd bullet is not valid anymore, I cannot reproduce it in latest master branch, this might have already been fixed in SPARK-13054. So only first issue still exists, I think @sitalkedia 's PR is enough to handle this 1st issue. I'm going to close this one. @sitalkedia would you please reopen your PR, sorry to bring in noise. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/11205 @vanzin , in the current code `stageIdToTaskIndices` cannot be used to track number of running tasks, because this structure doesn't remove task index from itself when task is finished successfully. Yes `isExecutorIdle` is used to take care of executor idle, but the way to identify whether executor is idle is not robust enough. In this scenario, when stage is aborted because of max task failures, some task end event will be missing, so using number of tasks per executor will lead to residual data, and makes executor always be busy. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19458 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19458 There's a UT failure (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83014/testReport/junit/org.apache.spark.storage/BlockIdSuite/test_bad_deserialization/). @superbobry please fix this failure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19519: [SPARK-21840][core] Add trait that allows conf to...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19519#discussion_r146737263 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkApplication.scala --- @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.lang.reflect.Modifier + +import org.apache.spark.SparkConf + +/** + * Entry point for a Spark application. Implementations must provide a no-argument constructor. + */ +private[spark] trait SparkApplication { + + def start(args: Array[String], conf: SparkConf): Unit + +} + +/** + * Implementation of SparkApplication that wraps a standard Java class with a "main" method. + * + * Configuration is propagated to the application via system properties, so running multiple + * of these in the same JVM may lead to undefined behavior due to configuration leaks. + */ +private[deploy] class JavaMainApplication(klass: Class[_]) extends SparkApplication { + + override def start(args: Array[String], conf: SparkConf): Unit = { +val mainMethod = klass.getMethod("main", new Array[String](0).getClass) +if (!Modifier.isStatic(mainMethod.getModifiers)) { + throw new IllegalStateException("The main method in the given main class must be static") +} + +val sysProps = conf.getAll.toMap +sysProps.foreach { case (k, v) => + sys.props(k) = v +} + +mainMethod.invoke(null, args) + } --- End diff -- I see, thanks for explanation. I cannot figure out a solution which doesn't break the current semantics of `SparkConf`, this might be the only choice. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19519: [SPARK-21840][core] Add trait that allows conf to...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19519#discussion_r146734075 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkApplication.scala --- @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.lang.reflect.Modifier + +import org.apache.spark.SparkConf + +/** + * Entry point for a Spark application. Implementations must provide a no-argument constructor. + */ +private[spark] trait SparkApplication { + + def start(args: Array[String], conf: SparkConf): Unit + +} + +/** + * Implementation of SparkApplication that wraps a standard Java class with a "main" method. + * + * Configuration is propagated to the application via system properties, so running multiple + * of these in the same JVM may lead to undefined behavior due to configuration leaks. + */ +private[deploy] class JavaMainApplication(klass: Class[_]) extends SparkApplication { + + override def start(args: Array[String], conf: SparkConf): Unit = { +val mainMethod = klass.getMethod("main", new Array[String](0).getClass) +if (!Modifier.isStatic(mainMethod.getModifiers)) { + throw new IllegalStateException("The main method in the given main class must be static") +} + +val sysProps = conf.getAll.toMap +sysProps.foreach { case (k, v) => + sys.props(k) = v +} + +mainMethod.invoke(null, args) + } --- End diff -- But based on your comment "allow multiple applications to be started in the same JVM", will this system properties contaminate follow-up applications? Sorry if I misunderstood your scenario. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18492: [SPARK-19326] Speculated task attempts do not get...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/18492#discussion_r146190420 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -373,8 +373,14 @@ private[spark] class ExecutorAllocationManager( // If our target has not changed, do not send a message // to the cluster manager and reset our exponential growth if (delta == 0) { - numExecutorsToAdd = 1 - return 0 + // Check if there is any speculative jobs pending + if (listener.pendingTasks == 0 && listener.pendingSpeculativeTasks > 0) { +numExecutorsTarget = + math.max(math.min(maxNumExecutorsNeeded + 1, maxNumExecutors), minNumExecutors) --- End diff -- @janewangfb Would you please explain why here `+ 1` if there's pending speculativeTasks, should the number of executors be calculated based on the number of pending tasks? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19554 @sjrand would you please close this PR, it is already merged to branch 2.2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19554 Thanks, merging to branch 2.2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19554 ok to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19554: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19554 Can you please add a tag in PR title `[BACKPORT-2.2]`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19554: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19554 ok to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19534 @sitalkedia I'm OK with either. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19519: [SPARK-21840][core] Add trait that allows conf to...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19519#discussion_r146154530 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkApplication.scala --- @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.lang.reflect.Modifier + +import org.apache.spark.SparkConf + +/** + * Entry point for a Spark application. Implementations must provide a no-argument constructor. + */ +private[spark] trait SparkApplication { + + def start(args: Array[String], conf: SparkConf): Unit + +} + +/** + * Implementation of SparkApplication that wraps a standard Java class with a "main" method. + * + * Configuration is propagated to the application via system properties, so running multiple + * of these in the same JVM may lead to undefined behavior due to configuration leaks. + */ +private[deploy] class JavaMainApplication(klass: Class[_]) extends SparkApplication { + + override def start(args: Array[String], conf: SparkConf): Unit = { +val mainMethod = klass.getMethod("main", new Array[String](0).getClass) +if (!Modifier.isStatic(mainMethod.getModifiers)) { + throw new IllegalStateException("The main method in the given main class must be static") +} + +val sysProps = conf.getAll.toMap +sysProps.foreach { case (k, v) => + sys.props(k) = v +} + +mainMethod.invoke(null, args) + } --- End diff -- @vanzin , do we need to remove all the system properties after `mainMethod` is finished? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19540 @sjrand , can you please create another PR against branch-2.2, it is not auto-mergeable, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19540 LGTM, merging to master and branch 2.2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19519: [SPARK-21840][core] Add trait that allows conf to be dir...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19519 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19540 I think branch 2.2 also has similar issue when fetching resources from remote secure HDFS. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19540 ok to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19540 Thanks for the fix! I didn't test on secure cluster when did glob path support, so I didn't realize such issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19534 @sitalkedia I have a very old similar PR #11205 , maybe you can refer to it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19469: [SPARK-22243][DStreams]spark.yarn.jars reload from confi...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19469 @felixcheung As you can see there's bunch of configurations needs to be added here in https://github.com/apache-spark-on-k8s/spark/pull/516, that's why I'm asking a general solutions for such related issue. I'm OK to merge this PR. But I would suspect similar PRs will still be created in future, since those issues are quite scenario specific, users may have different scenarios and can touch different issues regarding to this. So I'm just wondering if we could have a better solution for this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19519: [SPARK-21840][core] Add trait that allows conf to be dir...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19519 @vanzin , how do we leverage this new trait, would you please explain more? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19534 @sitalkedia would you please fix the PR title, seems it is broken now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19509 LGTM, merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19469: [SPARK-22243][DStreams]spark.yarn.jars reload from confi...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19469 @ssaavedra , yes I think so. with the pull-in of k8s support, I would guess more configurations need to be added to exclusion rule. With current solution, one by one PR doesn't make so sense. We should either figure out a general solution or refactor this part. Besides, as we moved to structured streaming, do we need to pay more efforts on these issues? @zsxwing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19509 LGTM, just one minor comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19509: [SPARK-22290][core] Avoid creating Hive delegatio...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19509#discussion_r145329972 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala --- @@ -347,6 +347,10 @@ package object config { .timeConf(TimeUnit.MILLISECONDS) .createWithDefault(Long.MaxValue) + private[spark] val KERBEROS_RELOGIN_PERIOD = ConfigBuilder("spark.yarn.kerberos.relogin.period") +.timeConf(TimeUnit.SECONDS) +.createWithDefaultString("1m") --- End diff -- I think we should put this into doc. Also is it too frequent to call? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19509 I see, thanks for the explanation. I didn't think about such scenario. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19263: [SPARK-22050][CORE] Allow BlockUpdated events to be opti...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19263 @vanzin, do you have other comments? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19469: [SPARK-22243][DStreams]spark.yarn.jars reload from confi...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19469 @ChenjunZou did you get a chance to look at my left comment? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19509 >The effect of this change is that now it's possible to initialize multiple, non-concurrent SparkContext instances in the same JVM. @vanzin , do we support in now? As I remembered it was not supported before. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19476: [SPARK-22062][CORE] Spill large block to disk in BlockMa...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19476 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r145013312 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -653,15 +663,34 @@ private[spark] class BlockManager( require(blockId != null, "BlockId is null") var runningFailureCount = 0 var totalFailureCount = 0 -val locations = getLocations(blockId) + +// Because all the remote blocks are registered in driver, so it is not necessary to ask +// all the slave executors to get block status. +val locationAndStatus = master.getLocationsAndStatus(blockId) + +val blockSize = locationAndStatus._2.map { status => + // Disk size and mem size cannot co-exist, so it's ok to sum them together to get block size. + status.diskSize + status.memSize --- End diff -- I get your point, also thinking of using `Math.max` instead. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r145011923 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala --- @@ -509,11 +508,10 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with BeforeAndAfterE val bmId1 = BlockManagerId("id1", localHost, 1) val bmId2 = BlockManagerId("id2", localHost, 2) val bmId3 = BlockManagerId("id3", otherHost, 3) -when(bmMaster.getLocations(mc.any[BlockId])).thenReturn(Seq(bmId1, bmId2, bmId3)) --- End diff -- Agreed, will revert it back. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r145011775 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -684,7 +713,7 @@ private[spark] class BlockManager( // take a significant amount of time. To get rid of these stale entries // we refresh the block locations after a certain number of fetch failures if (runningFailureCount >= maxFailuresBeforeLocationRefresh) { -locationIterator = getLocations(blockId).iterator +locationIterator = sortLocations(master.getLocationsAndStatus(blockId)._1).iterator --- End diff -- Agreed, will change it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r145010440 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -653,15 +663,34 @@ private[spark] class BlockManager( require(blockId != null, "BlockId is null") var runningFailureCount = 0 var totalFailureCount = 0 -val locations = getLocations(blockId) + +// Because all the remote blocks are registered in driver, so it is not necessary to ask +// all the slave executors to get block status. +val locationAndStatus = master.getLocationsAndStatus(blockId) + +val blockSize = locationAndStatus._2.map { status => + // Disk size and mem size cannot co-exist, so it's ok to sum them together to get block size. + status.diskSize + status.memSize --- End diff -- Are you saying we need to check `StorageLevel` to decide whether to use diskSize or memSize as block size? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r145009567 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -662,7 +662,9 @@ private[spark] object SparkConf extends Logging { "spark.yarn.jars" -> Seq( AlternateConfig("spark.yarn.jar", "2.0")), "spark.yarn.access.hadoopFileSystems" -> Seq( - AlternateConfig("spark.yarn.access.namenodes", "2.2")) + AlternateConfig("spark.yarn.access.namenodes", "2.2")), +"spark.maxRemoteBlockSizeFetchToMem" -> Seq( --- End diff -- Yes, I think so. `SparkConf` will print out warning log if we added here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r145009167 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -653,15 +663,34 @@ private[spark] class BlockManager( require(blockId != null, "BlockId is null") var runningFailureCount = 0 var totalFailureCount = 0 -val locations = getLocations(blockId) + +// Because all the remote blocks are registered in driver, so it is not necessary to ask +// all the slave executors to get block status. +val locationAndStatus = master.getLocationsAndStatus(blockId) + +val blockSize = locationAndStatus._2.map { status => + // Disk size and mem size cannot co-exist, so it's ok to sum them together to get block size. + status.diskSize + status.memSize --- End diff -- @jiangxb1987 would you please explain more? I'm not quite following your comment. Are you referring to the below line ` }.getOrElse(0L)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19419 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19396: [SPARK-22172][CORE] Worker hangs when the external shuff...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19396 Sorry for the late response. I understand you purpose now. I think such behavior discrepancy is not a big problem. I guess the reason why NM still run with exception is that NM doesn't serve only for Spark, but also MR/TEZ, so the failure of Spark external service should not affect MR's. Based on your comment above, I don't have a strong preference on either, I think both are OK. Maybe you can ping others to get their feedbacks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19419: [SPARK-22188] [CORE] Adding security headers for ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19419#discussion_r144775941 --- Diff: docs/security.md --- @@ -186,7 +186,54 @@ configure those ports. +### HTTP Security Headers + +Apache Spark can be configured to include HTTP Headers which aids in preventing Cross +Site Scripting (XSS), Cross-Frame Scripting (XFS), MIME-Sniffing and also enforces HTTP +Strict Transport Security. + + +Property NameDefaultMeaning --- End diff -- I think in Spark we follow 2 space indent for html code. You could refer to other docs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r144775481 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1552,4 +1582,65 @@ private[spark] object BlockManager { override val metricRegistry = new MetricRegistry metricRegistry.registerAll(metricSet) } + + class RemoteBlockTempFileManager(blockManager: BlockManager) + extends TempFileManager with Logging { + +private class ReferenceWithCleanup(file: File, referenceQueue: JReferenceQueue[File]) +extends WeakReference[File](file, referenceQueue) { + private val filePath = file.getAbsolutePath + + def cleanUp(): Unit = { +logDebug(s"Clean up file $filePath") + +if (!new File(filePath).delete()) { + logDebug(s"Fail to delete file $filePath") +} + } +} --- End diff -- My concern is that: for shuffle part, since there's a explicit API to `cleanup` temp files, so it's not so necessary to track again with weak reference. Also weak reference is triggered with GC, and shuffle operations are usually much more frequent and heavier, using weak reference to track temp shuffle files may increase the overhead of GC probably. Whereas, compared to shuffle, fetching remote blocks are happened occasionally when block is not cached in local, so using weak reference may not increase the overhead a lot. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r144770999 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1552,4 +1582,65 @@ private[spark] object BlockManager { override val metricRegistry = new MetricRegistry metricRegistry.registerAll(metricSet) } + + class RemoteBlockTempFileManager(blockManager: BlockManager) + extends TempFileManager with Logging { + +private class ReferenceWithCleanup(file: File, referenceQueue: JReferenceQueue[File]) +extends WeakReference[File](file, referenceQueue) { + private val filePath = file.getAbsolutePath + + def cleanUp(): Unit = { +logDebug(s"Clean up file $filePath") + +if (!new File(filePath).delete()) { + logDebug(s"Fail to delete file $filePath") +} + } +} --- End diff -- Yes, that's what I mean. No matter false (the caller) or true (`ShuffleBlockFetcherIterator`), the file will be deleted, that's my question why there still has file leak issue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r144769226 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1552,4 +1582,65 @@ private[spark] object BlockManager { override val metricRegistry = new MetricRegistry metricRegistry.registerAll(metricSet) } + + class RemoteBlockTempFileManager(blockManager: BlockManager) + extends TempFileManager with Logging { + +private class ReferenceWithCleanup(file: File, referenceQueue: JReferenceQueue[File]) +extends WeakReference[File](file, referenceQueue) { + private val filePath = file.getAbsolutePath + + def cleanUp(): Unit = { +logDebug(s"Clean up file $filePath") + +if (!new File(filePath).delete()) { + logDebug(s"Fail to delete file $filePath") +} + } +} --- End diff -- But here in `ShuffleBlockFetcherIterator#registerTempFileToClean`, the caller will delete the file if it returns false, does it still has file leak problem? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r144765076 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1552,4 +1582,65 @@ private[spark] object BlockManager { override val metricRegistry = new MetricRegistry metricRegistry.registerAll(metricSet) } + + class RemoteBlockTempFileManager(blockManager: BlockManager) + extends TempFileManager with Logging { + +private class ReferenceWithCleanup(file: File, referenceQueue: JReferenceQueue[File]) +extends WeakReference[File](file, referenceQueue) { + private val filePath = file.getAbsolutePath + + def cleanUp(): Unit = { +logDebug(s"Clean up file $filePath") + +if (!new File(filePath).delete()) { + logDebug(s"Fail to delete file $filePath") +} + } +} --- End diff -- I think the overhead is not big, but I'm not sure why there's a file leak issue here in `ShuffleBlockFetcherIterator` (the implementation of `TempFileManager`). From the code, all the temp files are tracked in `shuffleFilesSet`, and will be deleted during `cleanup`, can you please elaborate more? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r144764761 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -355,11 +355,21 @@ package object config { .doc("The blocks of a shuffle request will be fetched to disk when size of the request is " + "above this threshold. This is to avoid a giant request takes too much memory. We can " + "enable this config by setting a specific value(e.g. 200m). Note that this config can " + -"be enabled only when the shuffle shuffle service is newer than Spark-2.2 or the shuffle" + +"be enabled only when the shuffle service is newer than Spark-2.2 or the shuffle" + " service is disabled.") .bytesConf(ByteUnit.BYTE) .createWithDefault(Long.MaxValue) + private[spark] val MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM = +ConfigBuilder("spark.maxRemoteBlockSizeFetchToMem") + .doc("Remote block will be fetched to disk when size of the block is " + +"above this threshold. This is to avoid a giant request takes too much memory. We can " + +"enable this config by setting a specific value(e.g. 200m). Note this configuration will " + +"affect both shuffle fetch and block manager remote block fetch. For users who " + +"enabled external shuffle service, this feature can only be worked when external shuffle" + +" service is newer than Spark 2.2.") + .fallbackConf(REDUCER_MAX_REQ_SIZE_SHUFFLE_TO_MEM) --- End diff -- Thanks, let me check it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r144763884 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -355,11 +355,21 @@ package object config { .doc("The blocks of a shuffle request will be fetched to disk when size of the request is " + "above this threshold. This is to avoid a giant request takes too much memory. We can " + "enable this config by setting a specific value(e.g. 200m). Note that this config can " + -"be enabled only when the shuffle shuffle service is newer than Spark-2.2 or the shuffle" + +"be enabled only when the shuffle service is newer than Spark-2.2 or the shuffle" + " service is disabled.") .bytesConf(ByteUnit.BYTE) .createWithDefault(Long.MaxValue) + private[spark] val MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM = +ConfigBuilder("spark.maxRemoteBlockSizeFetchToMem") + .doc("Remote block will be fetched to disk when size of the block is " + +"above this threshold. This is to avoid a giant request takes too much memory. We can " + +"enable this config by setting a specific value(e.g. 200m). Note this configuration will " + +"affect both shuffle fetch and block manager remote block fetch. For users who " + +"enabled external shuffle service, this feature can only be worked when external shuffle" + +" service is newer than Spark 2.2.") + .fallbackConf(REDUCER_MAX_REQ_SIZE_SHUFFLE_TO_MEM) --- End diff -- From my understanding of the current code, it will not fallback to the deprecated config if we're using this api `SparkConf#get[T](entry: ConfigEntry[T])`, unless we specifically add `fallbackConf` definition. This is different from `SparkConf#getOption(key: String)`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19419 >/home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/internal/config/package.scala:440:0: Whitespace at end of line Please fix the style issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19419 ok to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19419: [SPARK-22188] [CORE] Adding security headers for ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19419#discussion_r144488504 --- Diff: docs/configuration.md --- @@ -2013,7 +2013,62 @@ Apart from these, the following properties are also available, and may be useful +### HTTP Security Headers --- End diff -- I think this could move to `security.md` as a security related advanced configurations. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19464#discussion_r144472362 --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala --- @@ -510,4 +510,86 @@ class FileSuite extends SparkFunSuite with LocalSparkContext { } } + test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") { +val conf = new SparkConf() +conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true) +sc = new SparkContext(conf) + +def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], numSlices: Int, + outputSuffix: Int, checkPart: String, expectedPartitionNum: Int): Unit = { + val dataRDD = sc.parallelize(data, numSlices) + val output = new File(tempDir, "output" + outputSuffix) + dataRDD.saveAsHadoopFile[TextOutputFormat[String, String]](output.getPath) + assert(new File(output, checkPart).exists() === true) + val hadoopRDD = sc.textFile(new File(output, "part-*").getPath) + assert(hadoopRDD.partitions.length === expectedPartitionNum) +} + +// Ensure that if all of the splits are empty, we remove the splits correctly +testIgnoreEmptySplits( + data = Array.empty[Tuple2[String, String]], + numSlices = 1, + outputSuffix = 0, + checkPart = "part-0", + expectedPartitionNum = 0) + +// Ensure that if no split is empty, we don't lose any splits +testIgnoreEmptySplits( + data = Array(("key1", "a"), ("key2", "a"), ("key3", "b")), + numSlices = 2, + outputSuffix = 1, + checkPart = "part-1", + expectedPartitionNum = 2) + +// Ensure that if part of the splits are empty, we remove the splits correctly +testIgnoreEmptySplits( + data = Array(("key1", "a"), ("key2", "a")), + numSlices = 5, + outputSuffix = 2, + checkPart = "part-4", + expectedPartitionNum = 2) + } + + test("spark.files.ignoreEmptySplits work correctly (new Hadoop API)") { +val conf = new SparkConf() +conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true) +sc = new SparkContext(conf) + +def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], numSlices: Int, --- End diff -- ditto. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19464#discussion_r144472244 --- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala --- @@ -510,4 +510,86 @@ class FileSuite extends SparkFunSuite with LocalSparkContext { } } + test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") { +val conf = new SparkConf() +conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true) +sc = new SparkContext(conf) + +def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], numSlices: Int, --- End diff -- nit: one argument per line. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r144456817 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -426,4 +426,11 @@ package object config { .toSequence .createOptional + private[spark] val MAX_REMOTE_BLOCK_SIZE_TO_MEM = --- End diff -- I would prefer to use `spark.storage.maxRemoteBlockSizeFetchToMemory`, since driver side block manager will also leverage this feature. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19476#discussion_r144453507 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -426,4 +426,11 @@ package object config { .toSequence .createOptional + private[spark] val MAX_REMOTE_BLOCK_SIZE_TO_MEM = --- End diff -- I was thinking about this, but the configuration name of `REDUCER_MAX_REQ_SIZE_SHUFFLE_TO_MEM` seems too shuffle specific, maybe we should rename it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19476: [SPARK-22062][CORE] Spill large block to disk in BlockMa...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19476 @cloud-fan @jiangxb1987 @jinxing64 would you please help to review when you have time, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19458#discussion_r144450997 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea /** List all the blocks currently stored on disk by the disk manager. */ def getAllBlocks(): Seq[BlockId] = { -getAllFiles().map(f => BlockId(f.getName)) +getAllFiles().flatMap { f => + val blockId = BlockId.guess(f.getName) --- End diff -- It looks not so necessary to define a new method `guess` for the use here only. I think here we can still use `apply` and catch/log the exception. In another word, we can simply changes `apply()` and use it here, defining new `guess` method is not so necessary. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19263: [SPARK-22050][CORE] Allow BlockUpdated events to ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19263#discussion_r19147 --- Diff: docs/configuration.md --- @@ -714,6 +714,13 @@ Apart from these, the following properties are also available, and may be useful Property NameDefaultMeaning + spark.eventLog.blockUpdates --- End diff -- I think it would be better to change the configuration name to be ended with ".enabled", to reflect this is a boolean configuration. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19263: [SPARK-22050][CORE] Allow BlockUpdated events to ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19263#discussion_r18138 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -41,6 +41,22 @@ package object config { .bytesConf(ByteUnit.MiB) .createWithDefaultString("1g") + private[spark] val EVENT_LOG_COMPRESS = + ConfigBuilder("spark.eventLog.compress").booleanConf.createWithDefault(false) + + private[spark] val EVENT_LOG_BLOCK_UPDATES = + ConfigBuilder("spark.eventLog.blockUpdates").booleanConf.createWithDefault(false) + + private[spark] val EVENT_LOG_TESTING = + ConfigBuilder("spark.eventLog.testing").booleanConf.createWithDefault(false) --- End diff -- I think this configuration should be marked as `internal()`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19419: [SPARK-22188] [CORE] Adding security headers for ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19419#discussion_r144286844 --- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala --- @@ -79,6 +79,9 @@ private[spark] object JettyUtils extends Logging { val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom") val xFrameOptionsValue = allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN") +val xXssProtectionValue = conf.getOption("spark.ui.xXssProtection") --- End diff -- It follows new code convention, newly added configurations is better to change to that way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19419: [SPARK-22188] [CORE] Adding security headers for ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19419#discussion_r144283398 --- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala --- @@ -89,6 +92,13 @@ private[spark] object JettyUtils extends Logging { val result = servletParams.responder(request) response.setHeader("Cache-Control", "no-cache, no-store, must-revalidate") response.setHeader("X-Frame-Options", xFrameOptionsValue) + xXssProtectionValue.foreach(response.setHeader("X-XSS-Protection", _)) +if (xContentTypeOptionsValue.get.equalsIgnoreCase("true")) { + response.setHeader("X-Content-Type-Options", "nosniff") +} +if (conf.get("spark.ssl.enabled").equalsIgnoreCase("true")) { --- End diff -- I think you can check `request.scheme` if it is "https" or not? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19458 Yes, I agree in any case it should not throw an exception. But in this PR you filtered out temp shuffle/local blocks, do you think this block is valid or not, are they blocks? So I'd like not filtering out those blocks, instead adding two parsing rules for those blocks. And for any other illegal files (cannot be parsed) catch and log the exception. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19458 Instead of filtering out temp blocks, why not adding parsing rule for `TempLocalBlockId` and `TempShuffleBlockId`? That could also solve the problem. Since `DiskBlockManager#getAllFiles` doesn't filter out temp shuffle/local files, is it better to keep the same behavior for `DiskBlockManager#getAllBlocks`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19419: [SPARK-22188] [CORE] Adding security headers for ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19419#discussion_r144186220 --- Diff: conf/spark-defaults.conf.template --- @@ -25,3 +25,10 @@ # spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" + +# spark.ui.allowFramingFrom https://www.example.com/ +# spark.ui.xXssProtection 1; mode=block +# spark.ui.xContentType.options nosniff + +# Enable below only when Spark is running on HTTPS +# spark.ui.strictTransportSecurity max-age=31536000 --- End diff -- What's the meaning of this specific number "31536000"? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19419 @vanzin @tgravescs @ajbozarth what is your opinion on this PR? Is it a necessary fix for Spark? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19468#discussion_r144182701 --- Diff: pom.xml --- @@ -2649,6 +2649,13 @@ + kubernetes --- End diff -- We should also change the sbt file to make it work using sbt. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19464 IIUC this issue also existed in `NewHadoopRDD` and `FileScanRDD` (possibly), we'd better also fix them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19464#discussion_r144181321 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -196,7 +196,10 @@ class HadoopRDD[K, V]( // add the credentials here as this can be called before SparkContext initialized SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) -val inputSplits = inputFormat.getSplits(jobConf, minPartitions) +var inputSplits = inputFormat.getSplits(jobConf, minPartitions) +if (sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) { --- End diff -- I would suggest not to use the name started by "spark.hadoop", this kind of configurations will be treated as Hadoop configuration and set into Hadoop `Configuration`, it might be better to choose another name. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...
GitHub user jerryshao opened a pull request: https://github.com/apache/spark/pull/19476 [SPARK-22062][CORE] Spill large block to disk in BlockManager's remote fetch to avoid OOM ## What changes were proposed in this pull request? In the current BlockManager's `getRemoteBytes`, it will call `BlockTransferService#fetchBlockSync` to get remote block. In the `fetchBlockSync`, Spark will allocate a temporary `ByteBuffer` to store the whole fetched block. This will potentially lead to OOM if block size is too big or several blocks are fetched simultaneously in this executor. So here leveraging the idea of shuffle fetch, to spill the large block to local disk before consumed by upstream code. The behavior is controlled by newly added configuration, if block size is smaller than the threshold, then this block will be persisted in memory; otherwise it will first spill to disk, and then read from disk file. To achieve this feature, what I did is: 1. Rename `TempShuffleFileManager` to `TempFileManager`, since now it is not only used by shuffle. 2. Add a new `TempFileManager` to manage the files of fetched remote blocks, the files are tracked by weak reference, will be deleted when no use at all. ## How was this patch tested? This was tested by adding UT, also manual verification in local test to perform GC to clean the files. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jerryshao/apache-spark SPARK-22062 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19476.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19476 commit f50a7b75c303bd2cf261dfb1b4fe74fa5498ca4b Author: jerryshao <ss...@hortonworks.com> Date: 2017-10-12T01:47:35Z Spill large blocks to disk during remote fetches in BlockManager --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19469: [SPARK-22243][DStreams]spark.yarn.jars reload from confi...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19469 There's a similar PR #19427 , I was wondering if we can provide a general solution for such issues, like using a configuration to specify all the confs which needs to be reloaded, spark.streaming.confsToReload = spark.yarn.jars,spark.xx.xx. So that we don't need to fix related issues again and again. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19466: [SPARK-22237] [CORE] Fix spark submit file download for ...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19466 Would you please show us an example of how it breaks? The codes here which assigning all resources to local ones might work, but it covers which line is really broken, can you please describe more? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19399: [SPARK-22175][WEB-UI] Add status column to history page
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19399 I agree with @squito that the criteria to define application's success should be well considered. Here in your current code, only if all the jobs are successful then the application is marked as successful, is it too strict that we cannot allow any failure and retry? Besides, if an application is successfully running all the Spark jobs, but fail on their own code (eg, saving to DB), and the application is exited with non-zero code, shall we mark the application succeed or failure? Also the structure to track all the jobs `jobToStatus ` will increase the memory occupation indefinitely in long running application. Besides with your changes I can see that page loading time will be increased, for those applications which have many jobs (like Spark Streaming) the problem will be severe. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19287: [SPARK-22074][Core] Task killed by other attempt task sh...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19287 LGTM, merging to master. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19077: [SPARK-21860][core]Improve memory reuse for heap ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19077#discussion_r143380706 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -116,9 +116,10 @@ private [sql] object GenArrayData { s"final ArrayData $arrayDataName = new $genericArrayClass($arrayName);", arrayDataName) } else { + val numBytes = elementType.defaultSize * numElements val unsafeArraySizeInBytes = UnsafeArrayData.calculateHeaderPortionInBytes(numElements) + - ByteArrayMethods.roundNumberOfBytesToNearestWord(elementType.defaultSize * numElements) +ByteArrayMethods.roundNumberOfBytesToNearestWord(numBytes).toInt --- End diff -- Minor: why don't we inline this instead of creating a new variable? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19287: [SPARK-22074][Core] Task killed by other attempt task sh...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19287 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19419: [SPARK-22188] [CORE] Adding security headers for ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19419#discussion_r143377794 --- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala --- @@ -79,6 +79,9 @@ private[spark] object JettyUtils extends Logging { val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom") val xFrameOptionsValue = allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN") +val xXssProtectionValue = conf.getOption("spark.ui.xXssProtection") --- End diff -- Please use `ConfigEntry` for newly added configurations, you could refer to `org.apache.spark.internal.config`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org