[GitHub] spark pull request #22485: [SPARK-18364][YARN] Expose metrics for YarnShuffl...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/22485#discussion_r219916796 --- Diff: common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java --- @@ -168,6 +170,15 @@ protected void serviceInit(Configuration conf) throws Exception { TransportConf transportConf = new TransportConf("shuffle", new HadoopConfigProvider(conf)); blockHandler = new ExternalShuffleBlockHandler(transportConf, registeredExecutorFile); + // register metrics on the block handler into the Node Manager's metrics system. + YarnShuffleServiceMetrics serviceMetrics = + new YarnShuffleServiceMetrics(blockHandler.getAllMetrics()); + + MetricsSystemImpl metricsSystem = (MetricsSystemImpl) DefaultMetricsSystem.instance(); + metricsSystem.register( --- End diff -- Sorry for the delay. My explanation for using reflection was here: https://github.com/palantir/spark/pull/149/files#r107770857 -- basically `registerSource` isn't accessible from here. As to why use `registerSource` vs `register`, I don't think I knew about `register` at the time. Looking at https://github.com/apache/hadoop/blame/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/metrics2/impl/MetricsSystemImpl.java `registerSource` ends up being called both ways. If you can get the right behavior using the accessible method rather than doing reflection, I'd say we should go with that way. cc @robert3005 @mccheah fysa for potential future merge conflict --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22485: [SPARK-18364][YARN] Expose metrics for YarnShuffleServic...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/22485 Sorry for not following through on getting this into Apache. FWIW, it's been in the Palantir fork of Spark for over a year: https://github.com/palantir/spark/search?q=SPARK-18364_q=SPARK-18364 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21334: [minor][SQL]Improve ParseError stop location when offend...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/21334 Hi @rubenfiszel thanks for the contribution! Can you please take a glance through http://spark.apache.org/contributing.html to see the best way to get your change merged into Apache Spark? I'd suggest you: - file an issue at https://issues.apache.org/jira/projects/SPARK and put that in the PR title - include a test that verifies the fix is working as you expect Cheers! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20372: Improved block merging logic for partitions
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/20372 Tagging folks who have touched this code recently: @vgankidi @ericl @davies This seems to provide a more compact packing in every scenario, which should improve execution times. One risk is that individual partitions are no longer always contiguous ranges of files in order, but rather sometimes they have a gap. In the test this is the `(file1, file6)` partition. If something depends on this past behavior it could now break, though I don't think anything should be requiring this partition ordering. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20372: Improved block merging logic for partitions
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/20372 Please fix the scala style checks -- ``` Running Scala style checks Scalastyle checks failed at following occurrences: [error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala:459: File line length exceeds 100 characters [error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala:463: File line length exceeds 100 characters [error] Total time: 14 s, completed Jan 23, 2018 10:44:36 PM [error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received return code 1 ``` and verify locally with `./dev/lint-scala` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20372: Improved block merging logic for partitions
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/20372#discussion_r163464207 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -445,16 +445,25 @@ case class FileSourceScanExec( currentSize = 0 } -// Assign files to partitions using "Next Fit Decreasing" -splitFiles.foreach { file => - if (currentSize + file.length > maxSplitBytes) { +def addFile(file: PartitionedFile): Unit = { +currentFiles += file +currentSize += file.length + openCostInBytes +} + +var frontIndex = 0 +var backIndex = splitFiles.length - 1 + +while (frontIndex <= backIndex) { +while (frontIndex <= backIndex && currentSize + splitFiles(frontIndex).length <= maxSplitBytes) { +addFile(splitFiles(frontIndex)) +frontIndex += 1 +} +while (backIndex > frontIndex && currentSize + splitFiles(backIndex).length <= maxSplitBytes) { +addFile(splitFiles(backIndex)) +backIndex -= 1 +} closePartition() - } - // Add the given file to the current partition. - currentSize += file.length + openCostInBytes --- End diff -- saw you added a commit to handle the large non-splittable files case -- can you please add a test for that also? want to make sure this while loop never becomes an infinite loop! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20372: Improved block merging logic for partitions
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/20372 Jenkins, this is ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20372: Improved block merging logic for partitions
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/20372#discussion_r163419745 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala --- @@ -142,15 +142,16 @@ class FileSourceStrategySuite extends QueryTest with SharedSQLContext with Predi SQLConf.FILES_OPEN_COST_IN_BYTES.key -> "1") { checkScan(table.select('c1)) { partitions => // Files should be laid out [(file1), (file2, file3), (file4, file5), (file6)] -assert(partitions.size == 4, "when checking partitions") -assert(partitions(0).files.size == 1, "when checking partition 1") +assert(partitions.size == 3, "when checking partitions") +assert(partitions(0).files.size == 2, "when checking partition 1") assert(partitions(1).files.size == 2, "when checking partition 2") assert(partitions(2).files.size == 2, "when checking partition 3") -assert(partitions(3).files.size == 1, "when checking partition 4") // First partition reads (file1) --- End diff -- comment is stale now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20372: Improved block merging logic for partitions
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/20372#discussion_r163424784 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -445,16 +445,25 @@ case class FileSourceScanExec( currentSize = 0 } -// Assign files to partitions using "Next Fit Decreasing" -splitFiles.foreach { file => - if (currentSize + file.length > maxSplitBytes) { +def addFile(file: PartitionedFile): Unit = { +currentFiles += file +currentSize += file.length + openCostInBytes +} + +var frontIndex = 0 +var backIndex = splitFiles.length - 1 + +while (frontIndex <= backIndex) { +while (frontIndex <= backIndex && currentSize + splitFiles(frontIndex).length <= maxSplitBytes) { +addFile(splitFiles(frontIndex)) +frontIndex += 1 +} +while (backIndex > frontIndex && currentSize + splitFiles(backIndex).length <= maxSplitBytes) { +addFile(splitFiles(backIndex)) +backIndex -= 1 +} closePartition() - } - // Add the given file to the current partition. - currentSize += file.length + openCostInBytes --- End diff -- should add a test for this too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20372: Improved block merging logic for partitions
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/20372#discussion_r163419675 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala --- @@ -142,15 +142,16 @@ class FileSourceStrategySuite extends QueryTest with SharedSQLContext with Predi SQLConf.FILES_OPEN_COST_IN_BYTES.key -> "1") { checkScan(table.select('c1)) { partitions => // Files should be laid out [(file1), (file2, file3), (file4, file5), (file6)] --- End diff -- comment is stale now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20372: Improved block merging logic for partitions
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/20372#discussion_r163424415 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -445,16 +445,25 @@ case class FileSourceScanExec( currentSize = 0 } -// Assign files to partitions using "Next Fit Decreasing" -splitFiles.foreach { file => - if (currentSize + file.length > maxSplitBytes) { +def addFile(file: PartitionedFile): Unit = { +currentFiles += file +currentSize += file.length + openCostInBytes +} + +var frontIndex = 0 +var backIndex = splitFiles.length - 1 + +while (frontIndex <= backIndex) { +while (frontIndex <= backIndex && currentSize + splitFiles(frontIndex).length <= maxSplitBytes) { +addFile(splitFiles(frontIndex)) +frontIndex += 1 +} +while (backIndex > frontIndex && currentSize + splitFiles(backIndex).length <= maxSplitBytes) { +addFile(splitFiles(backIndex)) +backIndex -= 1 +} closePartition() - } - // Add the given file to the current partition. - currentSize += file.length + openCostInBytes --- End diff -- this handles files that are larger than the maxSplitBytes, which I think isn't done in the new algorithm. Need to make sure those don't form an infinite loop --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19917: Add failing test for select with a splatted strea...
GitHub user ash211 opened a pull request: https://github.com/apache/spark/pull/19917 Add failing test for select with a splatted stream ## What changes were proposed in this pull request? Add additional test. ## How was this patch tested? Additional test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ash211/spark aash/PDS-59284 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19917.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19917 commit 2535b214ea5349209aecaa4b93868df548077c31 Author: Andrew Ash <and...@andrewash.com> Date: 2017-12-07T00:50:27Z Add test for PDS-59284 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can break when...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/19257 @cloud-fan @gatorsmile any more changes needed on this PR before merging? I don't see any un-addressed comments left. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19829: [WIP]Upgrade Netty to 4.1.17
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/19829 Looks like a fix for https://issues.apache.org/jira/browse/SPARK-19552 -- should that be reopened now that netty is deprecating 4.0.x so we can't do it "Later" anymore? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/19702#discussion_r151569727 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala --- @@ -372,23 +381,18 @@ private[parquet] class ParquetSchemaConverter( // `TIMESTAMP_MICROS` which are both logical types annotating `INT64`. // // Originally, Spark SQL uses the same nanosecond timestamp type as Impala and Hive. Starting - // from Spark 1.5.0, we resort to a timestamp type with 100 ns precision so that we can store - // a timestamp into a `Long`. This design decision is subject to change though, for example, - // we may resort to microsecond precision in the future. - // - // For Parquet, we plan to write all `TimestampType` value as `TIMESTAMP_MICROS`, but it's - // currently not implemented yet because parquet-mr 1.8.1 (the version we're currently using) - // hasn't implemented `TIMESTAMP_MICROS` yet, however it supports TIMESTAMP_MILLIS. We will - // encode timestamp values as TIMESTAMP_MILLIS annotating INT64 if - // 'spark.sql.parquet.int64AsTimestampMillis' is set. - // - // TODO Converts `TIMESTAMP_MICROS` once parquet-mr implements that. - - case TimestampType if writeTimestampInMillis => -Types.primitive(INT64, repetition).as(TIMESTAMP_MILLIS).named(field.name) - + // from Spark 1.5.0, we resort to a timestamp type with microsecond precision so that we can --- End diff -- I think this comment change loses some historical context about behavior around 1.5.0. From 1.5.0 to 2.2.0 timestamps were with 100ns precision, and now with this change, starting in 2.3.0 they are microsecond precision, right? The change as written seems to retroactively change the described behavior, unless comment was wrong --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19708: [SPARK-22479][SQL] Exclude credentials from Savei...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/19708#discussion_r151010772 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommandSuite.scala --- @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.SparkConf +import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.test.SharedSQLContext + +class SaveIntoDataSourceCommandSuite extends SharedSQLContext { + + override protected def sparkConf: SparkConf = super.sparkConf + .set("spark.redaction.string.regex", "(?i)password|url") + + test("treeString is redacted") { --- End diff -- old test name? we're not modifying the treeString anymore, it's just the `SaveIntoDataSourceCommand` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19708: [SPARK-22479][SQL] Exclude credentials from SaveintoData...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/19708 Jenkins, this is ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19694: [SPARK-22470][DOC][SQL] functions.hash is also us...
GitHub user ash211 opened a pull request: https://github.com/apache/spark/pull/19694 [SPARK-22470][DOC][SQL] functions.hash is also used internally for shuffle and bucketing ## What changes were proposed in this pull request? Add clarifying documentation to the scaladoc of `functions.hash` ## How was this patch tested? Existing tests, though this is a docs-only change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ash211/spark patch-4 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19694.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19694 commit 94278a7f76e5383228c7d974825abb80f4f0f8b9 Author: Andrew Ash <and...@andrewash.com> Date: 2017-11-08T10:56:31Z [SPARK-22470][DOC][SQL] functions.hash is also used internally for shuffle and bucketing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/19468#discussion_r147021138 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala --- @@ -0,0 +1,229 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.scheduler.cluster.k8s + +import scala.collection.JavaConverters._ + +import io.fabric8.kubernetes.api.model._ + +import org.apache.spark.{SparkConf, SparkException} +import org.apache.spark.deploy.k8s.ConfigurationUtils +import org.apache.spark.deploy.k8s.config._ +import org.apache.spark.deploy.k8s.constants._ +import org.apache.spark.util.Utils + +/** + * Configures executor pods. Construct one of these with a SparkConf to set up properties that are + * common across all executors. Then, pass in dynamic parameters into createExecutorPod. + */ +private[spark] trait ExecutorPodFactory { + def createExecutorPod( + executorId: String, + applicationId: String, + driverUrl: String, + executorEnvs: Seq[(String, String)], + driverPod: Pod, + nodeToLocalTaskCount: Map[String, Int]): Pod +} + +private[spark] class ExecutorPodFactoryImpl(sparkConf: SparkConf) + extends ExecutorPodFactory { + + import ExecutorPodFactoryImpl._ + + private val executorExtraClasspath = sparkConf.get( +org.apache.spark.internal.config.EXECUTOR_CLASS_PATH) + private val executorJarsDownloadDir = sparkConf.get(INIT_CONTAINER_JARS_DOWNLOAD_LOCATION) + + private val executorLabels = ConfigurationUtils.parsePrefixedKeyValuePairs( +sparkConf, +KUBERNETES_EXECUTOR_LABEL_PREFIX, +"executor label") + require( +!executorLabels.contains(SPARK_APP_ID_LABEL), +s"Custom executor labels cannot contain $SPARK_APP_ID_LABEL as it is reserved for Spark.") + require( +!executorLabels.contains(SPARK_EXECUTOR_ID_LABEL), +s"Custom executor labels cannot contain $SPARK_EXECUTOR_ID_LABEL as it is reserved for" + + s" Spark.") + + private val executorAnnotations = +ConfigurationUtils.parsePrefixedKeyValuePairs( + sparkConf, + KUBERNETES_EXECUTOR_ANNOTATION_PREFIX, + "executor annotation") + private val nodeSelector = +ConfigurationUtils.parsePrefixedKeyValuePairs( + sparkConf, + KUBERNETES_NODE_SELECTOR_PREFIX, + "node selector") + + private val executorDockerImage = sparkConf.get(EXECUTOR_DOCKER_IMAGE) + private val dockerImagePullPolicy = sparkConf.get(DOCKER_IMAGE_PULL_POLICY) + private val executorPort = sparkConf.getInt("spark.executor.port", DEFAULT_STATIC_PORT) + private val blockmanagerPort = sparkConf +.getInt("spark.blockmanager.port", DEFAULT_BLOCKMANAGER_PORT) + private val kubernetesDriverPodName = sparkConf +.get(KUBERNETES_DRIVER_POD_NAME) +.getOrElse(throw new SparkException("Must specify the driver pod name")) + + private val executorPodNamePrefix = sparkConf.get(KUBERNETES_EXECUTOR_POD_NAME_PREFIX) + + private val executorMemoryMiB = sparkConf.get(org.apache.spark.internal.config.EXECUTOR_MEMORY) + private val executorMemoryString = sparkConf.get( +org.apache.spark.internal.config.EXECUTOR_MEMORY.key, +org.apache.spark.internal.config.EXECUTOR_MEMORY.defaultValueString) + + private val memoryOverheadMiB = sparkConf +.get(KUBERNETES_EXECUTOR_MEMORY_OVERHEAD) +.getOrElse(math.max((MEMORY_OVERHEAD_FACTOR * executorMemoryMiB).toInt, + MEMORY_OVERHEAD_MIN_MIB)) + private val executorMemoryWithOverhead = executorMemoryMiB + memoryOverheadMiB + + private val executorCores = sparkConf.getDouble("spark.executor.cores", 1d) + private val executorLimitCores = sparkConf.g
[GitHub] spark pull request #19574: [SPARK-21991][LAUNCHER][FOLLOWUP] Fix java lint
GitHub user ash211 opened a pull request: https://github.com/apache/spark/pull/19574 [SPARK-21991][LAUNCHER][FOLLOWUP] Fix java lint ## What changes were proposed in this pull request? Fix java lint ## How was this patch tested? Run `./dev/lint-java` You can merge this pull request into a Git repository by running: $ git pull https://github.com/ash211/spark aash/fix-java-lint Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19574.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19574 commit f393c051ad2fc6b4be63447e3109fafbbea0be40 Author: Andrew Ash <and...@andrewash.com> Date: 2017-10-25T18:01:38Z [SPARK-21991][LAUNCHER][FOLLOWUP] Fix java lint --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19217: [SPARK-21991][LAUNCHER] Fix race condition in LauncherSe...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/19217 https://github.com/apache/spark/pull/19574 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19217: [SPARK-21991][LAUNCHER] Fix race condition in LauncherSe...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/19217 ``` Running Java style checks Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[238] (regexp) RegexpSingleline: No trailing whitespace allowed. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[247] (regexp) RegexpSingleline: No trailing whitespace allowed. [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19217: [SPARK-21991][LAUNCHER] LauncherServer acceptConnections...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/19217 @nivox can you please update the PR title when you get the chance? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19217: [SPARK-21991][LAUNCHER] LauncherServer acceptConnections...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/19217 How about `[SPARK-21991][LAUNCHER] Fix race condition in LauncherServer#acceptConnections` ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19217: [SPARK-21991][LAUNCHER] LauncherServer acceptConn...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/19217#discussion_r146493105 --- Diff: launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java --- @@ -232,20 +232,20 @@ public void run() { }; ServerConnection clientConnection = new ServerConnection(client, timeout); Thread clientThread = factory.newThread(clientConnection); -synchronized (timeout) { --- End diff -- we're no longer synchronizing on the `timeout` here, but I didn't see anywhere else synchronizing on it either (including `ServerConnection`). Given it doesn't escape this method, I'm not sure how multiple threads could ever access `timeout` at once, so it makes sense to remove this synchronization --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19217: [SPARK-21991][LAUNCHER] LauncherServer acceptConn...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/19217#discussion_r146492057 --- Diff: launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java --- @@ -232,20 +232,20 @@ public void run() { }; ServerConnection clientConnection = new ServerConnection(client, timeout); Thread clientThread = factory.newThread(clientConnection); -synchronized (timeout) { - clientThread.start(); - synchronized (clients) { -clients.add(clientConnection); - } - long timeoutMs = getConnectionTimeout(); - // 0 is used for testing to avoid issues with clock resolution / thread scheduling, - // and force an immediate timeout. - if (timeoutMs > 0) { -timeoutTimer.schedule(timeout, getConnectionTimeout()); - } else { -timeout.run(); - } +synchronized (clients) { + clients.add(clientConnection); --- End diff -- we are now adding to `clients` before starting the `clientThread` instead of after. What's the expected ordering for these two operations? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19217: [SPARK-21991][LAUNCHER] LauncherServer acceptConn...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/19217#discussion_r146493161 --- Diff: launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java --- @@ -232,20 +232,20 @@ public void run() { }; ServerConnection clientConnection = new ServerConnection(client, timeout); Thread clientThread = factory.newThread(clientConnection); -synchronized (timeout) { - clientThread.start(); - synchronized (clients) { -clients.add(clientConnection); - } - long timeoutMs = getConnectionTimeout(); - // 0 is used for testing to avoid issues with clock resolution / thread scheduling, - // and force an immediate timeout. - if (timeoutMs > 0) { -timeoutTimer.schedule(timeout, getConnectionTimeout()); - } else { -timeout.run(); - } +synchronized (clients) { + clients.add(clientConnection); +} + +long timeoutMs = getConnectionTimeout(); +// 0 is used for testing to avoid issues with clock resolution / thread scheduling, +// and force an immediate timeout. +if (timeoutMs > 0) { + timeoutTimer.schedule(timeout, timeoutMs); --- End diff -- +1 on not calling `getConnectionTimeout()` multiple times --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19486: [SPARK-22268][BUILD] Fix lint-java
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/19486 Updated --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/19269#discussion_r145296142 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Command.scala --- @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import org.apache.spark.{SparkException, TaskContext} +import org.apache.spark.internal.Logging +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder} +import org.apache.spark.sql.catalyst.expressions.UnsafeRow +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.execution.command.RunnableCommand +import org.apache.spark.sql.sources.v2.writer._ +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.Utils + +case class WriteToDataSourceV2Command(writer: DataSourceV2Writer, query: LogicalPlan) + extends RunnableCommand { --- End diff -- I've also observed this issue where the explain output of commands behaves differently than from logical plans, and have a repro at https://issues.apache.org/jira/browse/SPARK-22204 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19486: [SPARK-22268][BUILD] Fix lint-java
GitHub user ash211 opened a pull request: https://github.com/apache/spark/pull/19486 [SPARK-22268][BUILD] Fix lint-java ## What changes were proposed in this pull request? Fix java style issues ## How was this patch tested? Run `./dev/lint-java` locally since it's not run on Jenkins You can merge this pull request into a Git repository by running: $ git pull https://github.com/ash211/spark aash/fix-lint-java Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19486.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19486 commit 7c321fc7bc9e448ea3b7177089a8e814d901dd26 Author: Andrew Ash <and...@andrewash.com> Date: 2017-10-12T21:57:19Z Fix lint-java ./dev/lint-java --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/19468 Jenkins, ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19131: [MINOR][SQL]remove unuse import class
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/19131 A check for unused imports should be added to scalastyle to prevent these from creeping back in. If this PR was accompanied with that check (failing before, now passing) I think the merge conflicts are worth it to permanently fix the problem. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19164: [SPARK-21953] Show both memory and disk bytes spi...
GitHub user ash211 opened a pull request: https://github.com/apache/spark/pull/19164 [SPARK-21953] Show both memory and disk bytes spilled if either is present You can merge this pull request into a Git repository by running: $ git pull https://github.com/ash211/spark patch-3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19164.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19164 commit 7852abeaa22cd15e830004c99e035d37cf2d776c Author: Andrew Ash <and...@andrewash.com> Date: 2017-09-08T07:48:54Z [SPARK-21953] Show both memory and disk bytes spilled if either is present --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19136: [DO NOT MERGE][SPARK-15689][SQL] data source v2
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/19136#discussion_r137469790 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java --- @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader; + +import java.io.Serializable; + +/** + * A read task returned by a data source reader and is responsible to create the data reader. + * The relationship between `ReadTask` and `DataReader` is similar to `Iterable` and `Iterator`. + * + * Note that, the read task will be serialized and sent to executors, then the data reader will be + * created on executors and do the actual reading. + */ +public interface ReadTask extends Serializable { + /** + * The preferred locations for this read task to run faster, but Spark can't guarantee that this + * task will always run on these locations. Implementations should make sure that it can + * be run on any location. + */ + default String[] preferredLocations() { --- End diff -- what format are these strings expected to be in? If Spark will be placing this ReadTask onto an executor that is a preferred location, the format will need to be a documented part of the API are there levels of preference, or only the binary? I'm thinking node vs rack vs datacenter for on-prem clusters, or instance vs AZ vs region for cloud clusers --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19136: [DO NOT MERGE][SPARK-15689][SQL] data source v2
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/19136#discussion_r137471056 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/upward/StatisticsSupport.java --- @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader.upward; + +/** + * A mix in interface for `DataSourceV2Reader`. Users can implement this interface to report + * statistics to Spark. + */ +public interface StatisticsSupport { --- End diff -- some datasources have per-column statistics, like how many bytes a column has or its min/max (e.g. things required for CBO). should that be a separate interface from this one? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19153: SPARK-21941 Stop storing unused attemptId in SQLT...
GitHub user ash211 opened a pull request: https://github.com/apache/spark/pull/19153 SPARK-21941 Stop storing unused attemptId in SQLTaskMetrics ## What changes were proposed in this pull request? In a driver heap dump containing 390,105 instances of SQLTaskMetrics this would have saved approximately 3.2MB of memory. Since we're not getting any benefit from storing this unused value, let's eliminate it until a future PR makes use of it. ## How was this patch tested? Existing unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/ash211/spark aash/trim-sql-listener Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19153.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19153 commit 4490740b93d3b5d8f1f46e49a608115fe0d37a57 Author: Andrew Ash <and...@andrewash.com> Date: 2017-09-07T06:05:30Z SPARK-21941 Stop storing unused attemptId in SQLTaskMetrics In a driver heap dump containing 390,105 instances of SQLTaskMetrics this would have saved approximately 3.2MB of memory. Since we're not getting any benefit from storing this unused value, let's eliminate it until a future PR makes use of it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19088: [SPARK-21875][BUILD] Fix Java style bugs
GitHub user ash211 opened a pull request: https://github.com/apache/spark/pull/19088 [SPARK-21875][BUILD] Fix Java style bugs ## What changes were proposed in this pull request? Fix Java code style so `./dev/lint-java` succeeds ## How was this patch tested? Run `./dev/lint-java` You can merge this pull request into a Git repository by running: $ git pull https://github.com/ash211/spark spark-21875-lint-java Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19088.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19088 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18996: [MINOR][TYPO] Fix typos: runnning and Excecutors
GitHub user ash211 opened a pull request: https://github.com/apache/spark/pull/18996 [MINOR][TYPO] Fix typos: runnning and Excecutors ## What changes were proposed in this pull request? Fix typos ## How was this patch tested? Existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/ash211/spark patch-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18996.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18996 commit cf749df92de7e1485752610a2396a14724431250 Author: Andrew Ash <and...@andrewash.com> Date: 2017-08-18T20:02:36Z [MINOR][TYPO] Fix typos: runnning and Excecutors --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18913: [SPARK-21563][CORE] Fix race condition when seria...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/18913#discussion_r132721021 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1792,6 +1796,9 @@ class SparkContext(config: SparkConf) extends Logging { /** * Adds a JAR dependency for all tasks to be executed on this `SparkContext` in the future. + * + * If a jar is added during execution, it will not be available until the next TaskSet starts. --- End diff -- This additional sentence is probably overkill, since I'm sure there are many other subtleties about these methods that aren't spelled out in the Javadoc. I'm happy to remove and keep these docs simple if that's what a committer would prefer before merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18913: [SPARK-21563][CORE] Fix race condition when seria...
GitHub user ash211 opened a pull request: https://github.com/apache/spark/pull/18913 [SPARK-21563][CORE] Fix race condition when serializing TaskDescriptions and adding jars ## What changes were proposed in this pull request? Fix the race condition when serializing TaskDescriptions and adding jars by keeping the set of jars and files for a TaskSet constant across the lifetime of the TaskSet. Otherwise TaskDescription serialization can produce an invalid serialization when new file/jars are added concurrently as the TaskDescription is serialized. ## How was this patch tested? Additional unit test ensures jars/files contained in the TaskDescription remain constant throughout the lifetime of the TaskSet. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ash211/spark SPARK-21563 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18913.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18913 commit e874fbeeff532c4351ba7318c750cd322c4a24f5 Author: Andrew Ash <and...@andrewash.com> Date: 2017-08-10T23:32:15Z Add test commit b06425f7267e2f0e478000c30b60b963291aacb0 Author: Andrew Ash <and...@andrewash.com> Date: 2017-08-10T23:40:49Z Fix the test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18877: [SPARK-17742][core] Handle child process exit in ...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/18877#discussion_r132583520 --- Diff: launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java --- @@ -118,14 +116,40 @@ void setChildProc(Process childProc, String loggerName, InputStream logStream) { this.childProc = childProc; if (logStream != null) { this.redirector = new OutputRedirector(logStream, loggerName, -SparkLauncher.REDIRECTOR_FACTORY); +SparkLauncher.REDIRECTOR_FACTORY, this); +} else { + // If there is no log redirection, spawn a thread that will wait for the child process + // to finish. + Thread waiter = SparkLauncher.REDIRECTOR_FACTORY.newThread(this::monitorChild); + waiter.setDaemon(true); + waiter.start(); } } void setConnection(LauncherConnection connection) { this.connection = connection; } + /** + * Callback for when the child process exits. Forcefully put the application in a final state, + * overwriting the current final state unless it is already FAILED. + */ + synchronized void childProcessExited() { +disconnect(); + +int ec; +try { + ec = childProc.exitValue(); +} catch (Exception e) { --- End diff -- might want to log the exception here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18877: [SPARK-17742][core] Handle child process exit in ...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/18877#discussion_r132583553 --- Diff: launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java --- @@ -166,4 +185,15 @@ private synchronized void fireEvent(boolean isInfoChanged) { } } + private void monitorChild() { +while (childProc.isAlive()) { + try { +childProc.waitFor(); + } catch (Exception e) { +// Try again. --- End diff -- log exception here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18789: SPARK-20433 Bump jackson from 2.6.5 to 2.6.7.1
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/18789 @srowen sorry for not picking up on this -- thanks for pushing it over the finish line in your PR! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18789: Bump jackson from 2.6.5 to 2.6.7.1 (#241)
GitHub user ash211 opened a pull request: https://github.com/apache/spark/pull/18789 Bump jackson from 2.6.5 to 2.6.7.1 (#241) This brings in a security fix for CVE-2017-7525 in the jackson-databind library, which Spark uses. When releasing this patch, upstream released a 2.6.7.1 for jackson-databind but not a corresponding 2.6.7.1 for the rest of jackson, so those only go up to 2.6.7 This requires splitting the version variable in /pom.xml You can merge this pull request into a Git repository by running: $ git pull https://github.com/ash211/spark SPARK-20433 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18789.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18789 commit 35e69bd0066d5c8a978f71fde369bf103c8e9a5a Author: Andrew Ash <and...@andrewash.com> Date: 2017-07-27T03:58:41Z Bump jackson from 2.6.5 to 2.6.7.1 (#241) Would use 2.6.7 everywhere but upstream released a 2.6.7.1 for jackson-databind but not a corresponding 2.6.7 for the rest of jackson, so those remain on 2.6.7 This requires splitting the version variable in /pom.xml --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18658: [SPARK-20871][SQL] only log Janino code at debug level
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/18658 FYI for future reviewers as well, we've been running an [extremely similar patch](https://github.com/palantir/spark/pull/181) to PJ's on our distribution of Spark for the past several months and had no problems. Without this change, a code compilation failure often escalates into a heap OOM as the too-large source code is replicated in memory through the log statement. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18658: [SPARK-20871][SQL] only log Janino code at debug ...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/18658#discussion_r127819158 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1037,24 +1037,22 @@ object CodeGenerator extends Logging { )) evaluator.setExtendedClass(classOf[GeneratedClass]) -lazy val formatted = CodeFormatter.format(code) - logDebug({ // Only add extra debugging info to byte code when we are going to print the source code. evaluator.setDebuggingInformation(true, true, false) - s"\n$formatted" + s"\n${CodeFormatter.format(code)}" --- End diff -- I'd suggest dropping the string concatenation with `\n` here -- it requires an additional copy of the code to be held in-memory and for errors where the code is too long, this causes unnecessary additional pressure on the heap --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18658: [SPARK-20871] only log Janino code at debug level
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/18658 Jenkins this is ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18621: [SPARK-21400][SQL] Don't overwrite output committers on ...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/18621 jenkins this is ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/18581#discussion_r126534985 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMOptions.scala --- @@ -41,11 +41,15 @@ private[libsvm] class LibSVMOptions(@transient private val parameters: CaseInsen case o => throw new IllegalArgumentException(s"Invalid value `$o` for parameter " + s"`$VECTOR_TYPE`. Expected types are `sparse` and `dense`.") } + + val lineSeparator: Option[String] = parameters.get(LINE_SEPARATOR) --- End diff -- I've seen datasets that have multi-character delimiters of more than 2 characters. Specifically `|~|` So yes there is a use case, but it's a long tail one. I'd be happy to get this progress of up to 2 characters and work towards 3+ in a future PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18176: [SPARK-20952] Make TaskContext an InheritableTheadLocal
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/18176 Jenkins this is ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18406: [SPARK-21195] Automatically register new metrics from so...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/18406 @robert3005 looks like a bunch of tests are failing with `java.lang.IllegalArgumentException: A metric named local-1498509661743.driver.HiveExternalCatalog.fileCacheHits already exists` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18427: [SPARK-21219][scheduler] Fix race condition between addi...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/18427 Jenkins this is ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18406: [SPARK-21195] Automatically register new metrics from so...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/18406 Jenkins this is ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18209: [SPARK-20992][Scheduler] Add support for Nomad as a sche...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/18209 @manojlds I'm a part of the Spark-on-k8s team that's currently building k8s integration for Spark outside of the Apache Spark repo. You can follow our work at https://github.com/apache-spark-on-k8s/spark Here's the current outstanding diff on top of `apache/branch-2.1`: https://github.com/apache-spark-on-k8s/spark/pull/200 , produced from about 6months of work in a varied group from approaching a dozen companies at this point. We aim to eventually bring this into the Apache repo in some form, though the timeline for that is not yet fixed. You can follow from the original design and discussion for the kubernetes efforts at https://issues.apache.org/jira/browse/SPARK-18278 One of the major pieces of feedback we received back in November was that, if possible, the kubernetes integration should be done as a plugin to the Apache Spark repo. Unfortunately that plugin point does not yet exist in Spark. So we filed https://issues.apache.org/jira/browse/SPARK-19700 to begin the process of creating that API but have not yet devoted effort to building that plugin point. Once active development on the k8s integration begin to slow down and near completion (starting to see signs of this slowing this month) we'll probably shift focus to the plugin point as a means of gaining even wider use of the k8s integration. On what a plugin point could possibly look like, we produced a thought experiment at https://github.com/palantir/spark/pull/81 which you may find interesting. As of now, it remains unclear whether the k8s code will be brought in as a first-class citizen alongside Mesos and YARN, or whether Spark will gain a plugin point and the k8s integration. We'd be happy to be a part of any wider efforts to create a plugin point for custom resource managers at SPARK-19700. Is that something the Nomad team would be interested in contributing to? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17935: [SPARK-20690][SQL] Subqueries in FROM should have alias ...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/17935 @JoshRosen what was the other type of database you were using? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18176: [SPARK-20952] Make TaskContext an InheritableTheadLocal
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/18176 Jenkins this is ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17680: [SPARK-20364][SQL] Support Parquet predicate pushdown on...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/17680 Are there any comments on this PR or is it ready to be merged? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17680: [SPARK-20364][SQL] Support Parquet predicate pushdown on...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/17680 Any further thoughts on this? It was quite surprising for one of our users so I wanted to make sure it was fixed in a future Apache release --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/17680#discussion_r112531312 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // scalastyle:on nonascii } } + + test("SPARK-20364: Predicate pushdown for columns with a '.' in them") { +import testImplicits._ + +Seq(true, false).foreach { vectorized => + withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> vectorized.toString) { +withTempPath { path => + Seq(Some(1), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` > 0").count() == 1) +} + +withTempPath { path => + Seq(Some(1L), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` >= 1L").count() == 1) +} + +withTempPath { path => + Seq(Some(1.0F), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` < 2.0").count() == 1) +} + +withTempPath { path => + Seq(Some(1.0D), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` <= 1.0D").count() == 1) +} + +withTempPath { path => + Seq(true, false).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` == true").count() == 1) +} + +withTempPath { path => + Seq("apple", null).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert( +spark.read.parquet(path.getAbsolutePath).where("`col.dots` IS NOT NULL").count() == 1) --- End diff -- thanks for adding the additional test below --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/17680#discussion_r112529697 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala --- @@ -487,6 +487,20 @@ class FileSourceStrategySuite extends QueryTest with SharedSQLContext with Predi } } + test("no filter puwhdown for nested field access") { --- End diff -- nit: pushdown --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/17680#discussion_r112530989 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // scalastyle:on nonascii } } + + test("SPARK-20364: Predicate pushdown for columns with a '.' in them") { --- End diff -- thanks for the change -- I wasn't sure if predicate pushdown worked on nested columns and it looks like that change confirms it does not after this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/17680#discussion_r112285677 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // scalastyle:on nonascii } } + + test("SPARK-20364: Predicate pushdown for columns with a '.' in them") { --- End diff -- is there another existing test that checks that pushdown for `struct.field1` syntax works correctly? I'm not sure how to reference those inner fields in a struct field as I don't use it much personally, but want to make sure that's not broken as a result of this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/17680#discussion_r112285883 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // scalastyle:on nonascii } } + + test("SPARK-20364: Predicate pushdown for columns with a '.' in them") { +import testImplicits._ + +Seq(true, false).foreach { vectorized => + withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> vectorized.toString) { +withTempPath { path => + Seq(Some(1), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` > 0").count() == 1) +} + +withTempPath { path => + Seq(Some(1L), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` >= 1L").count() == 1) +} + +withTempPath { path => + Seq(Some(1.0F), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` < 2.0").count() == 1) +} + +withTempPath { path => + Seq(Some(1.0D), None).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` <= 1.0D").count() == 1) +} + +withTempPath { path => + Seq(true, false).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` == true").count() == 1) +} + +withTempPath { path => + Seq("apple", null).toDF("col.dots").write.parquet(path.getAbsolutePath) + assert( +spark.read.parquet(path.getAbsolutePath).where("`col.dots` IS NOT NULL").count() == 1) --- End diff -- please also do the check for `IS NULL` having 1 row too, so this is symmetric --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17667: Failing test for parquet predicate pushdown on co...
Github user ash211 closed the pull request at: https://github.com/apache/spark/pull/17667 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17667: Failing test for parquet predicate pushdown on columns w...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/17667 Agreed, will close for now until there's a fix to go along with the test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17667: Failing test for parquet predicate pushdown on dots with...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/17667 @HyukjinKwon thanks for looking at this! Please feel free to open a Jira so we can begin discussing a fix. I haven't started working on a patch yet, only have the test case at this point. So if you come up with a solution I'd love to see it! cc @robert3005 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17667: Failing test for parquet predicate pushdown on do...
GitHub user ash211 opened a pull request: https://github.com/apache/spark/pull/17667 Failing test for parquet predicate pushdown on dots with columns // checking against Jenkins to make sure this is still live on master You can merge this pull request into a Git repository by running: $ git pull https://github.com/ash211/spark test/parquet-dots-in-column2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17667.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17667 commit 51cb89b0587d06a6911e14f5a4381bed0b6da2eb Author: Andrew Ash <and...@andrewash.com> Date: 2017-04-18T01:23:05Z Failing test for parquet predicate pushdown on dots with columns --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17664: Typo fix: distitrbuted -> distributed
GitHub user ash211 opened a pull request: https://github.com/apache/spark/pull/17664 Typo fix: distitrbuted -> distributed ## What changes were proposed in this pull request? Typo fix: distitrbuted -> distributed ## How was this patch tested? Existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/ash211/spark patch-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17664.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17664 commit bc152fa1f7ed29ffa904bdf9a5ea0665be7a5bac Author: Andrew Ash <and...@andrewash.com> Date: 2017-04-17T23:48:30Z Typo fix: distitrbuted -> distributed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffleServic...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/17401 @jerryshao ready for re-review --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffl...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/17401#discussion_r108562943 --- Diff: common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleServiceMetrics.java --- @@ -0,0 +1,123 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.network.yarn; + +import com.codahale.metrics.*; +import com.google.common.annotations.VisibleForTesting; +import org.apache.hadoop.metrics2.MetricsCollector; +import org.apache.hadoop.metrics2.MetricsInfo; +import org.apache.hadoop.metrics2.MetricsRecordBuilder; +import org.apache.hadoop.metrics2.MetricsSource; + +import java.util.Map; + +/** + * Modeled off of YARN's NodeManagerMetrics. + */ +public class YarnShuffleServiceMetrics implements MetricsSource { + + private final MetricSet metricSet; + + public YarnShuffleServiceMetrics(MetricSet metricSet) { +this.metricSet = metricSet; + } + + /** + * Get metrics from the source + * + * @param collector to contain the resulting metrics snapshot + * @param all if true, return all metrics even if unchanged. + */ + @Override + public void getMetrics(MetricsCollector collector, boolean all) { +MetricsRecordBuilder metricsRecordBuilder = collector.addRecord("shuffleService"); + +for (Map.Entry<String, Metric> entry : metricSet.getMetrics().entrySet()) { + collectMetric(metricsRecordBuilder, entry.getKey(), entry.getValue()); +} + } + + @VisibleForTesting + public static void collectMetric(MetricsRecordBuilder metricsRecordBuilder, String name, Metric metric) { --- End diff -- I use `static` here to make it clear that the method does not need to be run in the context of an instance. This prevents it from accidentally accessing instance variables when I don't intend it to --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffleServic...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/17401 Ready for further review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17411: logging improvements
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/17411 Thanks for the contribution @juanrh ! I'm happy to see contributions no matter how small. For larger changes you would need to file a Jira ticket, but this is small enough that it's not necessary. See http://spark.apache.org/contributing.html point 2 in the "Jira" section. Please do make the title more descriptive though -- rename to something like `[DOCS] More detailed logging around YARN executor allocation` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffleServic...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/17401 Thanks again for the comments @jerryshao ! I've now added some tests to verify that the metrics get converted in the expected way to the collector, and camel-cased shuffleService --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffl...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/17401#discussion_r107840148 --- Diff: common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleServiceMetrics.java --- @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.network.yarn; + +import com.codahale.metrics.*; +import org.apache.hadoop.metrics2.MetricsCollector; +import org.apache.hadoop.metrics2.MetricsInfo; +import org.apache.hadoop.metrics2.MetricsRecordBuilder; +import org.apache.hadoop.metrics2.MetricsSource; + +import java.util.Map; + +/** + * Modeled off of YARN's NodeManagerMetrics. + */ +public class YarnShuffleServiceMetrics implements MetricsSource { + + private final MetricSet metricSet; + + public YarnShuffleServiceMetrics(MetricSet metricSet) { +this.metricSet = metricSet; + } + + /** + * Get metrics from the source + * + * @param collector to contain the resulting metrics snapshot + * @param all if true, return all metrics even if unchanged. + */ + @Override + public void getMetrics(MetricsCollector collector, boolean all) { --- End diff -- should be able to, I'm working on creating one now. By correctness, I think you mostly mean that the values passed through are the same, even though the naming schemes are different? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffl...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/17401#discussion_r107840109 --- Diff: common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java --- @@ -166,6 +170,23 @@ protected void serviceInit(Configuration conf) throws Exception { TransportConf transportConf = new TransportConf("shuffle", new HadoopConfigProvider(conf)); blockHandler = new ExternalShuffleBlockHandler(transportConf, registeredExecutorFile); + // register metrics on the block handler into the Node Manager's metrics system. + try { +YarnShuffleServiceMetrics serviceMetrics = new YarnShuffleServiceMetrics( + blockHandler.getAllMetrics()); +MetricsSystemImpl metricsSystem = (MetricsSystemImpl) DefaultMetricsSystem.instance(); + +Method registerSourceMethod = metricsSystem.getClass().getDeclaredMethod("registerSource", +String.class, String.class, MetricsSource.class); +registerSourceMethod.setAccessible(true); +registerSourceMethod.invoke(metricsSystem, "shuffleservice", "Metrics on the Spark " + --- End diff -- will change shortly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffleServic...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/17401 Thanks for taking a look @jerryshao ! I've reformatted to two-space indentation and run `./dev/lint-java` to make sure this code passes the linter. `src/main/java/org/apache/spark/sql/streaming/GroupStateTimeout.java` still has some errors but those are separate from this change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffl...
GitHub user ash211 opened a pull request: https://github.com/apache/spark/pull/17401 [SPARK-18364][YARN] Expose metrics for YarnShuffleService Registers the shuffle server's metrics with the Hadoop Node Manager's DefaultMetricsSystem. ## What changes were proposed in this pull request? Expose the shuffle service metrics not only on the ExternalShuffleService as done in SPARK-16405, but also in the YarnShuffleService. Because the YARN Node Manager operates under ## How was this patch tested? Manual deploy of new yarn-shuffle jar into a Node Manager and verifying that the metrics appear in the Node Manager-standard location. This is JMX with an query endpoint at `/jmx`. Resulting metrics look like this: ``` [user@host ~]$ curl -sk -XGET https://`hostname -f`:8042/jmx | jq . | grep 'shuffleservice' -B 1 -A 18 { "name": "Hadoop:service=NodeManager,name=shuffleservice", "modelerType": "shuffleservice", "tag.Hostname": "", "openBlockRequestLatencyMillis_count": 1, "openBlockRequestLatencyMillis_rate15": 0.0011080303990206543, "openBlockRequestLatencyMillis_rate5": 0.0033057092356765017, "openBlockRequestLatencyMillis_rate1": 0.015991117074135343, "openBlockRequestLatencyMillis_rateMean": 0.003843993699021382, "blockTransferRateBytes_count": 118, "blockTransferRateBytes_rate15": 0.1307475870844372, "blockTransferRateBytes_rate5": 0.39007368980982715, "blockTransferRateBytes_rate1": 1.8869518147479705, "blockTransferRateBytes_rateMean": 0.45359183094454836, "registeredExecutorsSize": 2, "registerExecutorRequestLatencyMillis_count": 2, "registerExecutorRequestLatencyMillis_rate15": 0.001697343764758814, "registerExecutorRequestLatencyMillis_rate5": 0.002970701813078509, "registerExecutorRequestLatencyMillis_rate1": 0.0005857750515146702, "registerExecutorRequestLatencyMillis_rateMean": 0.007687995987242345 }, [user@host ~]$ ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/ash211/spark spark-18364 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17401.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17401 commit 4caf4d28819d82240f1d61e11cdabd68fafad14a Author: Andrew Ash <and...@andrewash.com> Date: 2017-03-23T02:59:38Z [SPARK-18364][YARN] Expose metrics for YarnShuffleService Registers the shuffle server's metrics with the Hadoop Node Manager's DefaultMetricsSystem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17399: Update functions.scala
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/17399 Thanks for contributing to Spark @roxannemoslehi ! I think Sean just means updating the title to something more like `[DOCS] Clarify round mode in format_number function`. It doesn't feel big enough to be worth a Jira ticket for such a small docs change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14615: [SPARK-17029] make toJSON not go through rdd form but op...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/14615 Jenkins, this is ok to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14615: [SPARK-17029] make toJSON not go through rdd form but op...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/14615 @robert3005 looks like this has unit test failures on `org.apache.spark.sql.hive.orc.OrcSourceSuite.SPARK-19459/SPARK-18220: read char/varchar column written by Hive` -- is that a flake? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14615: [SPARK-17029] make toJSON not go through rdd form but op...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/14615 Jenkins, this is ok to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16959: [SPARK-19631][CORE] OutputCommitCoordinator should not a...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/16959 Any last changes before merging? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16959: [SPARK-19631][CORE] OutputCommitCoordinator shoul...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/16959#discussion_r103576994 --- Diff: core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala --- @@ -195,6 +195,17 @@ class OutputCommitCoordinatorSuite extends SparkFunSuite with BeforeAndAfter { sc.runJob(rdd, OutputCommitFunctions(tempDir.getAbsolutePath).callCanCommitMultipleTimes _, 0 until rdd.partitions.size) } + + test("SPARK-19631: Do not allow failed attempts to be authorized for committing") { +val stage: Int = 1 +val partition: Int = 1 +val failedAttempt: Int = 0 +outputCommitCoordinator.stageStart(stage, maxPartitionId = 1) +outputCommitCoordinator.taskCompleted(stage, partition, attemptNumber = failedAttempt, + reason = ExecutorLostFailure("0", exitCausedByApp = true, None)) +assert(!outputCommitCoordinator.canCommit(stage, partition, failedAttempt)) --- End diff -- My understanding is that this change doesn't affect any task that's mid-commit, it only changes the pre-commit portion of a task attempt's lifecycle to prevent failed tasks (=> heartbeat timeouts as reported by the DAGScheduler) from receiving authorization to commit. Is there a new way which two task attempts can be authorized simultaneously that didn't exist before? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16959: [SPARK-19631][CORE] OutputCommitCoordinator should not a...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/16959 @vanzin are you right person to review this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16959: [SPARK-19631][CORE] OutputCommitCoordinator shoul...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/16959#discussion_r101762924 --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala --- @@ -48,25 +48,28 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean) private type StageId = Int private type PartitionId = Int private type TaskAttemptNumber = Int + private case class StageState(authorizedCommitters: Array[TaskAttemptNumber], --- End diff -- can you put a comment here that the index into the `authorizedCommitters` array is the partitionId ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16575: [SPARK-19213] DatasourceScanExec uses runtime sparksessi...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/16575 @hvanhovell does that description make sense? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16503: [SPARK-18113] Use ask to replace askWithRetry in canComm...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/16503 Making this idempotent looks great. I think there's a separate issue with this code still not handling poorly-timed preemption, but let's deal with that in a separate ticket / PR. Good work so far @jinxing64 ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16575: [SPARK-19213] DatasourceScanExec uses runtime sparksessi...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/16575 @hvanhovell the need is also explained in the Jira ticket: https://issues.apache.org/jira/browse/SPARK-19213 Does that code snippet make sense? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16503: [SPARK-18113] Use ask to replace askWithRetry in canComm...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/16503 @jinxing64 can you please fix the failing Scala style tests? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16503: [SPARK-18113] Use ask to replace askWithRetry in canComm...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/16503 You covered my concerns! I think this will fix some parts of this problem for sure, not sure if it covers every possible case though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16503: [SPARK-18113] Use ask to replace askWithRetry in ...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/16503#discussion_r95660339 --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala --- @@ -165,9 +167,14 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean) authorizedCommitters(partition) = attemptNumber true case existingCommitter => -logDebug(s"Denying attemptNumber=$attemptNumber to commit for stage=$stage, " + - s"partition=$partition; existingCommitter = $existingCommitter") -false +// Coordinator should be idempotent when receiving AskPermissionToCommit. +if (existingCommitter == attemptNumber) { + true --- End diff -- please log a warning here before returning `true` -- reaching this branch is likely indicative of network problems ``` logWarning(s"Authorizing duplicate request to commit for " + s"attemptNumber=$attemptNumber to commit for stage=$stage, partition=$partition; " + s"existingCommitter = $existingCommitter. This can indicate dropped network traffic.") ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16503: [SPARK-18113] Use ask to replace askWithRetry in ...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/16503#discussion_r95659921 --- Diff: core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala --- @@ -221,6 +227,22 @@ private case class OutputCommitFunctions(tempDirPath: String) { if (ctx.attemptNumber == 0) failingOutputCommitter else successfulOutputCommitter) } + // Receiver should be idempotent for AskPermissionToCommitOutput + def callCanCommitMultiTimes(iter: Iterator[Int]): Unit = { --- End diff -- nit rename to `callCanCommitMultipleTimes` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16503: [SPARK-18113] Use ask to replace askWithRetry in ...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/16503#discussion_r95659396 --- Diff: core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala --- @@ -189,6 +188,13 @@ class OutputCommitCoordinatorSuite extends SparkFunSuite with BeforeAndAfter { assert( !outputCommitCoordinator.canCommit(stage, partition, nonAuthorizedCommitter + 3)) } + + test("Authoried commiter get true if it calls canCommit again.") { --- End diff -- Reword test description to: `Duplicate calls to canCommit from the authorized committer gets idempotent responses` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16558: Fix missing close-parens for In filter's toString
GitHub user ash211 opened a pull request: https://github.com/apache/spark/pull/16558 Fix missing close-parens for In filter's toString Otherwise the open parentheses isn't closed in query plan descriptions of batch scans. PushedFilters: [In(COL_A, [1,2,4,6,10,16,219,815], IsNotNull(COL_B), ... You can merge this pull request into a Git repository by running: $ git pull https://github.com/ash211/spark patch-9 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16558.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16558 commit a3d737b87eef1ca79a69c4714073edab8ffaadaa Author: Andrew Ash <and...@andrewash.com> Date: 2017-01-12T07:53:40Z Fix missing close-parens for In filter's toString Otherwise the open parentheses isn't closed in query plan descriptions of batch scans. PushedFilters: [In(COL_A, [1,2,4,6,10,16,219,815], IsNotNull(COL_B), ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/16281 What are the specific patches to parquet that folks are proposing should be included in a parquet 1.8.1-spark1 ? Or what would be desired in a parquet-released 1.8.2 ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16092: [SPARK-18662] Move resource managers to separate directo...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/16092 With a k8s backend on the way I do think it adds a nice organization for these 3 clearly grouped modules --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16061: [SPARK-18278] [Scheduler] Support native submission of s...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/16061 Another external scheduler backend I'm aware of is Two Sigma's scheduler backend for the system they've created called [Cook](https://github.com/twosigma/Cook). See [CoarseCookSchedulerBackend.scala](https://github.com/twosigma/spark/blob/2.0.2-cook/core/src/main/scala/org/apache/spark/scheduler/CoarseCookSchedulerBackend.scala) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15932: [SPARK-18448][CORE] SparkSession should implement java.l...
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/15932 Yep that's precisely what I was envisioning. Thanks @srowen ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15835: [SPARK-17059][SQL] Allow FileFormat to specify pa...
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/15835#discussion_r87703118 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala --- @@ -703,6 +705,81 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext } } } + + test("SPARK-17059: Allow FileFormat to specify partition pruning strategy") { +withSQLConf(ParquetOutputFormat.ENABLE_JOB_SUMMARY -> "true") { + withTempPath { path => +spark.sparkContext.parallelize(Seq(1, 2, 3), 3) + .toDF("x").write.parquet(path.getCanonicalPath) + +val zeroPartitions = spark.read.parquet(path.getCanonicalPath).where("x = 0") +assert(zeroPartitions.rdd.partitions.length == 0) + +val onePartition = spark.read.parquet(path.getCanonicalPath).where("x = 1") +assert(onePartition.rdd.partitions.length == 1) + } +} + } + + test("Do not filter out parquet file when missing in _metadata file") { +withSQLConf(ParquetOutputFormat.ENABLE_JOB_SUMMARY -> "true") { + withTempPath { path => +spark.sparkContext.parallelize(Seq(1, 2, 3), 3) + .toDF("x").write.parquet(path.getCanonicalPath) +spark.sparkContext.parallelize(Seq(4)) + .toDF("x").write.mode(SaveMode.Append).parquet(path.getCanonicalPath) --- End diff -- do you need to turn off `ParquetOutputFormat.ENABLE_JOB_SUMMARY` to prevent the `_metadata` file from being updated on the second write, or is it normally not updated in `SaveMode.Append` ? I would've expected the second write here to update the `_metadata` file in which case the test name wouldn't match the behavior. But if the second write _doesn't_ update `_metadata` then the file has 3 out of 4 file partitions in it, matching the test name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15486: Typo: form -> from
GitHub user ash211 opened a pull request: https://github.com/apache/spark/pull/15486 Typo: form -> from ## What changes were proposed in this pull request? Minor typo fix ## How was this patch tested? Existing unit tests on Jenkins You can merge this pull request into a Git repository by running: $ git pull https://github.com/ash211/spark patch-8 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15486.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15486 commit be8b0b5f39be4389a28768dfa2756da11f87937f Author: Andrew Ash <and...@andrewash.com> Date: 2016-10-14T15:09:26Z Typo: form -> from --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org