[GitHub] AmplabJenkins removed a comment on issue #23281: [MINOR][DOC]update the condition description of BypassMergeSortShuffl…
AmplabJenkins removed a comment on issue #23281: [MINOR][DOC]update the condition description of BypassMergeSortShuffl… URL: https://github.com/apache/spark/pull/23281#issuecomment-446053958 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23281: [MINOR][DOC]update the condition description of BypassMergeSortShuffl…
AmplabJenkins commented on issue #23281: [MINOR][DOC]update the condition description of BypassMergeSortShuffl… URL: https://github.com/apache/spark/pull/23281#issuecomment-446053958 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23281: [MINOR][DOC]update the condition description of BypassMergeSortShuffl…
AmplabJenkins commented on issue #23281: [MINOR][DOC]update the condition description of BypassMergeSortShuffl… URL: https://github.com/apache/spark/pull/23281#issuecomment-446054026 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] lcqzte10192193 opened a new pull request #23281: [MINOR][DOC]update the condition description of BypassMergeSortShuffl…
lcqzte10192193 opened a new pull request #23281: [MINOR][DOC]update the condition description of BypassMergeSortShuffl… URL: https://github.com/apache/spark/pull/23281 …eWriter ## What changes were proposed in this pull request? These three condition description should update, follow #23228 : no Ordering is specified, no Aggregator is specified, and the number of partitions is less than spark.shuffle.sort.bypassMergeThreshold. 1、If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, serialized shuffle can still be used. 2、If the number of output partitions is 16777216 , we can use serialized shuffle. ## How was this patch tested? N/A This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] mccheah commented on a change in pull request #23220: [SPARK-25877][k8s] Move all feature logic to feature classes.
mccheah commented on a change in pull request #23220: [SPARK-25877][k8s] Move all feature logic to feature classes. URL: https://github.com/apache/spark/pull/23220#discussion_r240456173 ## File path: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/PodBuilderSuite.scala ## @@ -0,0 +1,177 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s + +import java.io.File + +import io.fabric8.kubernetes.api.model.{Config => _, _} +import io.fabric8.kubernetes.client.KubernetesClient +import io.fabric8.kubernetes.client.dsl.{MixedOperation, PodResource} +import org.mockito.Matchers.any +import org.mockito.Mockito.{mock, never, verify, when} +import scala.collection.JavaConverters._ + +import org.apache.spark.{SparkConf, SparkException, SparkFunSuite} +import org.apache.spark.deploy.k8s._ +import org.apache.spark.internal.config.ConfigEntry + +abstract class PodBuilderSuite extends SparkFunSuite { + + protected def templateFileConf: ConfigEntry[_] + + protected def buildPod(sparkConf: SparkConf, client: KubernetesClient): SparkPod + + private val baseConf = new SparkConf(false) +.set(Config.CONTAINER_IMAGE, "spark-executor:latest") + + test("use empty initial pod if template is not specified") { +val client = mock(classOf[KubernetesClient]) +buildPod(baseConf.clone(), client) +verify(client, never()).pods() + } + + test("load pod template if specified") { +val client = mockKubernetesClient() +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val pod = buildPod(sparkConf, client) +verifyPod(pod) + } + + test("complain about misconfigured pod template") { +val client = mockKubernetesClient( + new PodBuilder() +.withNewMetadata() +.addToLabels("test-label-key", "test-label-value") +.endMetadata() +.build()) +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val exception = intercept[SparkException] { + buildPod(sparkConf, client) +} +assert(exception.getMessage.contains("Could not load pod from template file.")) + } + + private def mockKubernetesClient(pod: Pod = podWithSupportedFeatures()): KubernetesClient = { +val kubernetesClient = mock(classOf[KubernetesClient]) +val pods = + mock(classOf[MixedOperation[Pod, PodList, DoneablePod, PodResource[Pod, DoneablePod]]]) +val podResource = mock(classOf[PodResource[Pod, DoneablePod]]) +when(kubernetesClient.pods()).thenReturn(pods) +when(pods.load(any(classOf[File]))).thenReturn(podResource) +when(podResource.get()).thenReturn(pod) +kubernetesClient + } + + private def verifyPod(pod: SparkPod): Unit = { Review comment: Another factor of my concern about is that for each individual assertion, it is unclear which step the assertion is tied to. This reads a lot more like an ETE test than a unit test. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] mccheah commented on a change in pull request #23220: [SPARK-25877][k8s] Move all feature logic to feature classes.
mccheah commented on a change in pull request #23220: [SPARK-25877][k8s] Move all feature logic to feature classes. URL: https://github.com/apache/spark/pull/23220#discussion_r240454265 ## File path: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/PodBuilderSuite.scala ## @@ -0,0 +1,177 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s + +import java.io.File + +import io.fabric8.kubernetes.api.model.{Config => _, _} +import io.fabric8.kubernetes.client.KubernetesClient +import io.fabric8.kubernetes.client.dsl.{MixedOperation, PodResource} +import org.mockito.Matchers.any +import org.mockito.Mockito.{mock, never, verify, when} +import scala.collection.JavaConverters._ + +import org.apache.spark.{SparkConf, SparkException, SparkFunSuite} +import org.apache.spark.deploy.k8s._ +import org.apache.spark.internal.config.ConfigEntry + +abstract class PodBuilderSuite extends SparkFunSuite { + + protected def templateFileConf: ConfigEntry[_] + + protected def buildPod(sparkConf: SparkConf, client: KubernetesClient): SparkPod + + private val baseConf = new SparkConf(false) +.set(Config.CONTAINER_IMAGE, "spark-executor:latest") + + test("use empty initial pod if template is not specified") { +val client = mock(classOf[KubernetesClient]) +buildPod(baseConf.clone(), client) +verify(client, never()).pods() + } + + test("load pod template if specified") { +val client = mockKubernetesClient() +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val pod = buildPod(sparkConf, client) +verifyPod(pod) + } + + test("complain about misconfigured pod template") { +val client = mockKubernetesClient( + new PodBuilder() +.withNewMetadata() +.addToLabels("test-label-key", "test-label-value") +.endMetadata() +.build()) +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val exception = intercept[SparkException] { + buildPod(sparkConf, client) +} +assert(exception.getMessage.contains("Could not load pod from template file.")) + } + + private def mockKubernetesClient(pod: Pod = podWithSupportedFeatures()): KubernetesClient = { +val kubernetesClient = mock(classOf[KubernetesClient]) +val pods = + mock(classOf[MixedOperation[Pod, PodList, DoneablePod, PodResource[Pod, DoneablePod]]]) +val podResource = mock(classOf[PodResource[Pod, DoneablePod]]) +when(kubernetesClient.pods()).thenReturn(pods) +when(pods.load(any(classOf[File]))).thenReturn(podResource) +when(podResource.get()).thenReturn(pod) +kubernetesClient + } + + private def verifyPod(pod: SparkPod): Unit = { +val metadata = pod.pod.getMetadata +assert(metadata.getLabels.containsKey("test-label-key")) +assert(metadata.getAnnotations.containsKey("test-annotation-key")) +assert(metadata.getNamespace === "namespace") +assert(metadata.getOwnerReferences.asScala.exists(_.getName == "owner-reference")) +val spec = pod.pod.getSpec +assert(!spec.getContainers.asScala.exists(_.getName == "executor-container")) +assert(spec.getDnsPolicy === "dns-policy") +assert(spec.getHostAliases.asScala.exists(_.getHostnames.asScala.exists(_ == "hostname"))) Review comment: Nit: The second call to `exists` can be `contains` instead, so that we don't pass a function object that ignores the argument. Alternatively, both `exists` calls can be removed: ``` spec.getHostAliases.asScala.flatMap(_.getHostnames).contains("hostname") ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] mccheah commented on a change in pull request #23220: [SPARK-25877][k8s] Move all feature logic to feature classes.
mccheah commented on a change in pull request #23220: [SPARK-25877][k8s] Move all feature logic to feature classes. URL: https://github.com/apache/spark/pull/23220#discussion_r240454417 ## File path: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/PodBuilderSuite.scala ## @@ -0,0 +1,177 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s + +import java.io.File + +import io.fabric8.kubernetes.api.model.{Config => _, _} +import io.fabric8.kubernetes.client.KubernetesClient +import io.fabric8.kubernetes.client.dsl.{MixedOperation, PodResource} +import org.mockito.Matchers.any +import org.mockito.Mockito.{mock, never, verify, when} +import scala.collection.JavaConverters._ + +import org.apache.spark.{SparkConf, SparkException, SparkFunSuite} +import org.apache.spark.deploy.k8s._ +import org.apache.spark.internal.config.ConfigEntry + +abstract class PodBuilderSuite extends SparkFunSuite { + + protected def templateFileConf: ConfigEntry[_] + + protected def buildPod(sparkConf: SparkConf, client: KubernetesClient): SparkPod + + private val baseConf = new SparkConf(false) +.set(Config.CONTAINER_IMAGE, "spark-executor:latest") + + test("use empty initial pod if template is not specified") { +val client = mock(classOf[KubernetesClient]) +buildPod(baseConf.clone(), client) +verify(client, never()).pods() + } + + test("load pod template if specified") { +val client = mockKubernetesClient() +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val pod = buildPod(sparkConf, client) +verifyPod(pod) + } + + test("complain about misconfigured pod template") { +val client = mockKubernetesClient( + new PodBuilder() +.withNewMetadata() +.addToLabels("test-label-key", "test-label-value") +.endMetadata() +.build()) +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val exception = intercept[SparkException] { + buildPod(sparkConf, client) +} +assert(exception.getMessage.contains("Could not load pod from template file.")) + } + + private def mockKubernetesClient(pod: Pod = podWithSupportedFeatures()): KubernetesClient = { +val kubernetesClient = mock(classOf[KubernetesClient]) +val pods = + mock(classOf[MixedOperation[Pod, PodList, DoneablePod, PodResource[Pod, DoneablePod]]]) +val podResource = mock(classOf[PodResource[Pod, DoneablePod]]) +when(kubernetesClient.pods()).thenReturn(pods) +when(pods.load(any(classOf[File]))).thenReturn(podResource) +when(podResource.get()).thenReturn(pod) +kubernetesClient + } + + private def verifyPod(pod: SparkPod): Unit = { +val metadata = pod.pod.getMetadata +assert(metadata.getLabels.containsKey("test-label-key")) +assert(metadata.getAnnotations.containsKey("test-annotation-key")) +assert(metadata.getNamespace === "namespace") +assert(metadata.getOwnerReferences.asScala.exists(_.getName == "owner-reference")) +val spec = pod.pod.getSpec +assert(!spec.getContainers.asScala.exists(_.getName == "executor-container")) Review comment: Nit: Use `.asScala.map(_.getName).contains("executor-container")`. Similar changes come up throughout this test. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] mccheah commented on a change in pull request #23220: [SPARK-25877][k8s] Move all feature logic to feature classes.
mccheah commented on a change in pull request #23220: [SPARK-25877][k8s] Move all feature logic to feature classes. URL: https://github.com/apache/spark/pull/23220#discussion_r240454863 ## File path: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/PodBuilderSuite.scala ## @@ -0,0 +1,177 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s + +import java.io.File + +import io.fabric8.kubernetes.api.model.{Config => _, _} +import io.fabric8.kubernetes.client.KubernetesClient +import io.fabric8.kubernetes.client.dsl.{MixedOperation, PodResource} +import org.mockito.Matchers.any +import org.mockito.Mockito.{mock, never, verify, when} +import scala.collection.JavaConverters._ + +import org.apache.spark.{SparkConf, SparkException, SparkFunSuite} +import org.apache.spark.deploy.k8s._ +import org.apache.spark.internal.config.ConfigEntry + +abstract class PodBuilderSuite extends SparkFunSuite { + + protected def templateFileConf: ConfigEntry[_] + + protected def buildPod(sparkConf: SparkConf, client: KubernetesClient): SparkPod + + private val baseConf = new SparkConf(false) +.set(Config.CONTAINER_IMAGE, "spark-executor:latest") + + test("use empty initial pod if template is not specified") { +val client = mock(classOf[KubernetesClient]) +buildPod(baseConf.clone(), client) +verify(client, never()).pods() + } + + test("load pod template if specified") { +val client = mockKubernetesClient() +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val pod = buildPod(sparkConf, client) +verifyPod(pod) + } + + test("complain about misconfigured pod template") { +val client = mockKubernetesClient( + new PodBuilder() +.withNewMetadata() +.addToLabels("test-label-key", "test-label-value") +.endMetadata() +.build()) +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val exception = intercept[SparkException] { + buildPod(sparkConf, client) +} +assert(exception.getMessage.contains("Could not load pod from template file.")) + } + + private def mockKubernetesClient(pod: Pod = podWithSupportedFeatures()): KubernetesClient = { +val kubernetesClient = mock(classOf[KubernetesClient]) +val pods = + mock(classOf[MixedOperation[Pod, PodList, DoneablePod, PodResource[Pod, DoneablePod]]]) +val podResource = mock(classOf[PodResource[Pod, DoneablePod]]) +when(kubernetesClient.pods()).thenReturn(pods) +when(pods.load(any(classOf[File]))).thenReturn(podResource) +when(podResource.get()).thenReturn(pod) +kubernetesClient + } + + private def verifyPod(pod: SparkPod): Unit = { Review comment: The number of things this test checks is remarkable, and it is very much possible to accidentally omit checking the application of a specific feature when a new one is added for either the driver or executor. This is why we had the overridable feature steps in the original incarnation of these tests. Not mocking the substeps leads us to need to check that some specific aspect of each step has been applied. Can we go back to mocking the different steps so that this test can be more easily modified when we add more features? Or else can we abstract away the idea that these steps are applied without this test itself needing to know what the step itself actually does? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] mccheah commented on a change in pull request #23220: [SPARK-25877][k8s] Move all feature logic to feature classes.
mccheah commented on a change in pull request #23220: [SPARK-25877][k8s] Move all feature logic to feature classes. URL: https://github.com/apache/spark/pull/23220#discussion_r240454265 ## File path: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/PodBuilderSuite.scala ## @@ -0,0 +1,177 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s + +import java.io.File + +import io.fabric8.kubernetes.api.model.{Config => _, _} +import io.fabric8.kubernetes.client.KubernetesClient +import io.fabric8.kubernetes.client.dsl.{MixedOperation, PodResource} +import org.mockito.Matchers.any +import org.mockito.Mockito.{mock, never, verify, when} +import scala.collection.JavaConverters._ + +import org.apache.spark.{SparkConf, SparkException, SparkFunSuite} +import org.apache.spark.deploy.k8s._ +import org.apache.spark.internal.config.ConfigEntry + +abstract class PodBuilderSuite extends SparkFunSuite { + + protected def templateFileConf: ConfigEntry[_] + + protected def buildPod(sparkConf: SparkConf, client: KubernetesClient): SparkPod + + private val baseConf = new SparkConf(false) +.set(Config.CONTAINER_IMAGE, "spark-executor:latest") + + test("use empty initial pod if template is not specified") { +val client = mock(classOf[KubernetesClient]) +buildPod(baseConf.clone(), client) +verify(client, never()).pods() + } + + test("load pod template if specified") { +val client = mockKubernetesClient() +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val pod = buildPod(sparkConf, client) +verifyPod(pod) + } + + test("complain about misconfigured pod template") { +val client = mockKubernetesClient( + new PodBuilder() +.withNewMetadata() +.addToLabels("test-label-key", "test-label-value") +.endMetadata() +.build()) +val sparkConf = baseConf.clone().set(templateFileConf.key, "template-file.yaml") +val exception = intercept[SparkException] { + buildPod(sparkConf, client) +} +assert(exception.getMessage.contains("Could not load pod from template file.")) + } + + private def mockKubernetesClient(pod: Pod = podWithSupportedFeatures()): KubernetesClient = { +val kubernetesClient = mock(classOf[KubernetesClient]) +val pods = + mock(classOf[MixedOperation[Pod, PodList, DoneablePod, PodResource[Pod, DoneablePod]]]) +val podResource = mock(classOf[PodResource[Pod, DoneablePod]]) +when(kubernetesClient.pods()).thenReturn(pods) +when(pods.load(any(classOf[File]))).thenReturn(podResource) +when(podResource.get()).thenReturn(pod) +kubernetesClient + } + + private def verifyPod(pod: SparkPod): Unit = { +val metadata = pod.pod.getMetadata +assert(metadata.getLabels.containsKey("test-label-key")) +assert(metadata.getAnnotations.containsKey("test-annotation-key")) +assert(metadata.getNamespace === "namespace") +assert(metadata.getOwnerReferences.asScala.exists(_.getName == "owner-reference")) +val spec = pod.pod.getSpec +assert(!spec.getContainers.asScala.exists(_.getName == "executor-container")) +assert(spec.getDnsPolicy === "dns-policy") +assert(spec.getHostAliases.asScala.exists(_.getHostnames.asScala.exists(_ == "hostname"))) Review comment: The second call to `exists` can be `contains` instead, so that we don't pass a function object that ignores the argument. Alternatively, both `exists` calls can be removed: ``` spec.getHostAliases.asScala.flatMap(_.getHostnames).contains("hostname") ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23280: [MINOR][SQL] Some errors in the notes.
AmplabJenkins removed a comment on issue #23280: [MINOR][SQL] Some errors in the notes. URL: https://github.com/apache/spark/pull/23280#issuecomment-446051121 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23280: [MINOR][SQL] Some errors in the notes.
AmplabJenkins commented on issue #23280: [MINOR][SQL] Some errors in the notes. URL: https://github.com/apache/spark/pull/23280#issuecomment-446051398 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23280: [MINOR][SQL] Some errors in the notes.
AmplabJenkins commented on issue #23280: [MINOR][SQL] Some errors in the notes. URL: https://github.com/apache/spark/pull/23280#issuecomment-446051121 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23280: [MINOR][SQL] Some errors in the notes.
AmplabJenkins removed a comment on issue #23280: [MINOR][SQL] Some errors in the notes. URL: https://github.com/apache/spark/pull/23280#issuecomment-446051038 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23280: [MINOR][SQL] Some errors in the notes.
AmplabJenkins commented on issue #23280: [MINOR][SQL] Some errors in the notes. URL: https://github.com/apache/spark/pull/23280#issuecomment-446051038 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr
HyukjinKwon commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#discussion_r240455039 ## File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala ## @@ -246,13 +246,56 @@ private[yarn] class ExecutorRunnable( sys.env.get("SPARK_USER").foreach { user => val containerId = ConverterUtils.toString(c.getId) val address = c.getNodeHttpAddress -val baseUrl = s"$httpScheme$address/node/containerlogs/$containerId/$user" -env("SPARK_LOG_URL_STDERR") = s"$baseUrl/stderr?start=-4096" -env("SPARK_LOG_URL_STDOUT") = s"$baseUrl/stdout?start=-4096" +sparkConf.get(config.CUSTOM_LOG_URL) match { + case Some(customUrl) => +val pathVariables = ExecutorRunnable.buildPathVariables(httpScheme, address, + YarnConfiguration.getClusterId(conf), containerId, user) +val envNameToFileNameMap = Map("SPARK_LOG_URL_STDERR" -> "stderr", + "SPARK_LOG_URL_STDOUT" -> "stdout") +val logUrls = ExecutorRunnable.replaceLogUrls(customUrl, pathVariables, + envNameToFileNameMap) + +logUrls.foreach { case (envName, url) => + env(envName) = url +} + case None => +val baseUrl = s"$httpScheme$address/node/containerlogs/$containerId/$user" +env("SPARK_LOG_URL_STDERR") = s"$baseUrl/stderr?start=-4096" +env("SPARK_LOG_URL_STDOUT") = s"$baseUrl/stdout?start=-4096" + } } } env } } + +private[yarn] object ExecutorRunnable { + val LOG_URL_PATTERN_HTTP_SCHEME = "{{HttpScheme}}" + val LOG_URL_PATTERN_NODE_HTTP_ADDRESS = "{{NodeHttpAddress}}" + val LOG_URL_PATTERN_CLUSTER_ID = "{{ClusterId}}" + val LOG_URL_PATTERN_CONTAINER_ID = "{{ContainerId}}" + val LOG_URL_PATTERN_USER = "{{User}}" + val LOG_URL_PATTERN_FILE_NAME = "{{FileName}}" + + def buildPathVariables(httpScheme: String, nodeHttpAddress: String, clusterId: String, Review comment: Strictly, the guide line itself does not explicitly mention about two lines - it says it's okay if it fits within 2 lines (see https://github.com/databricks/scala-style-guide/pull/64#issuecomment-344609090). I intentionally avoided this because some of codes in some components do not comply this. However, strictly we should better stick to two spaces indentation whenever possible per https://github.com/databricks/scala-style-guide#indent > Use 2-space indentation in general. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] CarolinePeng opened a new pull request #23280: [MINOR][SQL] Some errors in the notes.
CarolinePeng opened a new pull request #23280: [MINOR][SQL] Some errors in the notes. URL: https://github.com/apache/spark/pull/23280 ## What changes were proposed in this pull request? When using ordinals to access linked list, the time cost is O(n). ## How was this patch tested? Existing tests. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr
HyukjinKwon commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#discussion_r240455039 ## File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala ## @@ -246,13 +246,56 @@ private[yarn] class ExecutorRunnable( sys.env.get("SPARK_USER").foreach { user => val containerId = ConverterUtils.toString(c.getId) val address = c.getNodeHttpAddress -val baseUrl = s"$httpScheme$address/node/containerlogs/$containerId/$user" -env("SPARK_LOG_URL_STDERR") = s"$baseUrl/stderr?start=-4096" -env("SPARK_LOG_URL_STDOUT") = s"$baseUrl/stdout?start=-4096" +sparkConf.get(config.CUSTOM_LOG_URL) match { + case Some(customUrl) => +val pathVariables = ExecutorRunnable.buildPathVariables(httpScheme, address, + YarnConfiguration.getClusterId(conf), containerId, user) +val envNameToFileNameMap = Map("SPARK_LOG_URL_STDERR" -> "stderr", + "SPARK_LOG_URL_STDOUT" -> "stdout") +val logUrls = ExecutorRunnable.replaceLogUrls(customUrl, pathVariables, + envNameToFileNameMap) + +logUrls.foreach { case (envName, url) => + env(envName) = url +} + case None => +val baseUrl = s"$httpScheme$address/node/containerlogs/$containerId/$user" +env("SPARK_LOG_URL_STDERR") = s"$baseUrl/stderr?start=-4096" +env("SPARK_LOG_URL_STDOUT") = s"$baseUrl/stdout?start=-4096" + } } } env } } + +private[yarn] object ExecutorRunnable { + val LOG_URL_PATTERN_HTTP_SCHEME = "{{HttpScheme}}" + val LOG_URL_PATTERN_NODE_HTTP_ADDRESS = "{{NodeHttpAddress}}" + val LOG_URL_PATTERN_CLUSTER_ID = "{{ClusterId}}" + val LOG_URL_PATTERN_CONTAINER_ID = "{{ContainerId}}" + val LOG_URL_PATTERN_USER = "{{User}}" + val LOG_URL_PATTERN_FILE_NAME = "{{FileName}}" + + def buildPathVariables(httpScheme: String, nodeHttpAddress: String, clusterId: String, Review comment: Strictly, the guide line itself does not explicitly mention about two lines - it says it's okay if it fits within 2 lines (see https://github.com/databricks/scala-style-guide/pull/64#issuecomment-344609090). I intentionally avoided this because some of codes in some components do not comply this. However, strictly we should better stick to two line indentation whenever possible per https://github.com/databricks/scala-style-guide#indent > Use 2-space indentation in general. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE
SparkQA commented on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE URL: https://github.com/apache/spark/pull/23213#issuecomment-446048530 **[Test build #99942 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99942/testReport)** for PR 23213 at commit [`a9c108f`](https://github.com/apache/spark/commit/a9c108fa090b847d48848cf6d679aa6747dcc534). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE
AmplabJenkins removed a comment on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE URL: https://github.com/apache/spark/pull/23213#issuecomment-446048394 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5947/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE
AmplabJenkins commented on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE URL: https://github.com/apache/spark/pull/23213#issuecomment-446048394 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5947/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on a change in pull request #23272: [SPARK-26265][Core] Fix deadlock in BytesToBytesMap.MapIterator when locking both BytesToBytesMap.MapIterator and TaskMemoryManager
cloud-fan commented on a change in pull request #23272: [SPARK-26265][Core] Fix deadlock in BytesToBytesMap.MapIterator when locking both BytesToBytesMap.MapIterator and TaskMemoryManager URL: https://github.com/apache/spark/pull/23272#discussion_r240453178 ## File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java ## @@ -283,6 +290,9 @@ private void advanceToNextPage() { } } } + if (pageToFree != null) { +freePage(pageToFree); Review comment: is it possible that this page is already freed by another consumer? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE
AmplabJenkins removed a comment on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE URL: https://github.com/apache/spark/pull/23213#issuecomment-446048389 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE
AmplabJenkins commented on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE URL: https://github.com/apache/spark/pull/23213#issuecomment-446048389 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23252: [SPARK-26239] File-based secret key loading for SASL.
AmplabJenkins removed a comment on issue #23252: [SPARK-26239] File-based secret key loading for SASL. URL: https://github.com/apache/spark/pull/23252#issuecomment-446047256 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE
cloud-fan commented on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE URL: https://github.com/apache/spark/pull/23213#issuecomment-446047462 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23252: [SPARK-26239] File-based secret key loading for SASL.
AmplabJenkins commented on issue #23252: [SPARK-26239] File-based secret key loading for SASL. URL: https://github.com/apache/spark/pull/23252#issuecomment-446047258 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99935/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23252: [SPARK-26239] File-based secret key loading for SASL.
AmplabJenkins removed a comment on issue #23252: [SPARK-26239] File-based secret key loading for SASL. URL: https://github.com/apache/spark/pull/23252#issuecomment-446047258 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99935/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23252: [SPARK-26239] File-based secret key loading for SASL.
AmplabJenkins commented on issue #23252: [SPARK-26239] File-based secret key loading for SASL. URL: https://github.com/apache/spark/pull/23252#issuecomment-446047256 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA removed a comment on issue #23252: [SPARK-26239] File-based secret key loading for SASL.
SparkQA removed a comment on issue #23252: [SPARK-26239] File-based secret key loading for SASL. URL: https://github.com/apache/spark/pull/23252#issuecomment-445985434 **[Test build #99935 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99935/testReport)** for PR 23252 at commit [`eade6e2`](https://github.com/apache/spark/commit/eade6e2db84ebab2ede5a375a1a2c1303fefc1fe). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23252: [SPARK-26239] File-based secret key loading for SASL.
SparkQA commented on issue #23252: [SPARK-26239] File-based secret key loading for SASL. URL: https://github.com/apache/spark/pull/23252#issuecomment-446046903 **[Test build #99935 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99935/testReport)** for PR 23252 at commit [`eade6e2`](https://github.com/apache/spark/commit/eade6e2db84ebab2ede5a375a1a2c1303fefc1fe). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on a change in pull request #23275: [SPARK-26323][SQL] Scala UDF should still check input types even if some inputs are of type Any
cloud-fan commented on a change in pull request #23275: [SPARK-26323][SQL] Scala UDF should still check input types even if some inputs are of type Any URL: https://github.com/apache/spark/pull/23275#discussion_r240451835 ## File path: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala ## @@ -88,68 +88,49 @@ sealed trait UserDefinedFunction { private[sql] case class SparkUserDefinedFunction( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]], -nullableTypes: Option[Seq[Boolean]], +inputSchemas: Seq[Option[ScalaReflection.Schema]], name: Option[String] = None, nullable: Boolean = true, deterministic: Boolean = true) extends UserDefinedFunction { @scala.annotation.varargs - override def apply(exprs: Column*): Column = { -// TODO: make sure this class is only instantiated through `SparkUserDefinedFunction.create()` -// and `nullableTypes` is always set. -if (inputTypes.isDefined) { - assert(inputTypes.get.length == nullableTypes.get.length) -} - -val inputsNullSafe = nullableTypes.getOrElse { - ScalaReflection.getParameterTypeNullability(f) -} + override def apply(cols: Column*): Column = { +Column(createScalaUDF(cols.map(_.expr))) + } -Column(ScalaUDF( + private[sql] def createScalaUDF(exprs: Seq[Expression]): ScalaUDF = { +// It's possible that some of the inputs don't have a specific type(e.g. `Any`), skip type +// check and null check for them. +val inputTypes = inputSchemas.map(_.map(_.dataType).getOrElse(AnyDataType)) +val inputsNullSafe = inputSchemas.map(_.map(_.nullable).getOrElse(true)) Review comment: Here `getOrElse` maybe better, as it matches the previous line. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] yaooqinn closed pull request #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on executor side
yaooqinn closed pull request #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on executor side URL: https://github.com/apache/spark/pull/19840 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala index f524de68fbce0..9d16dbbe9a0c8 100644 --- a/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala +++ b/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala @@ -57,9 +57,14 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs") + private val env = SparkEnv.get + private val conf = env.conf + // All the Python functions should have the same exec, version and envvars. protected val envVars = funcs.head.funcs.head.envVars - protected val pythonExec = funcs.head.funcs.head.pythonExec + protected val pythonExec = conf.getOption("spark.executorEnv.PYSPARK_DRIVER_PYTHON") +.getOrElse(conf.getOption("spark.executorEnv.PYSPARK_PYTHON") + .getOrElse(funcs.head.funcs.head.pythonExec)) protected val pythonVer = funcs.head.funcs.head.pythonVer // TODO: support accumulator in multiple UDF @@ -70,7 +75,6 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( partitionIndex: Int, context: TaskContext): Iterator[OUT] = { val startTime = System.currentTimeMillis -val env = SparkEnv.get val localdir = env.blockManager.diskBlockManager.localDirs.map(f => f.getPath()).mkString(",") envVars.put("SPARK_LOCAL_DIRS", localdir) // it's also used in monitor thread if (reuseWorker) { This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on a change in pull request #23275: [SPARK-26323][SQL] Scala UDF should still check input types even if some inputs are of type Any
cloud-fan commented on a change in pull request #23275: [SPARK-26323][SQL] Scala UDF should still check input types even if some inputs are of type Any URL: https://github.com/apache/spark/pull/23275#discussion_r240449840 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ## @@ -47,25 +47,13 @@ case class ScalaUDF( function: AnyRef, dataType: DataType, children: Seq[Expression], -inputsNullSafe: Seq[Boolean], -inputTypes: Seq[DataType] = Nil, +@transient inputsNullSafe: Seq[Boolean], Review comment: expressions are usually serialized to executor side. Previously it's fine, as all data types are case class, which is serializable. But `AnyDataType` is normal scala object, which is not serializable. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] asfgit closed pull request #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance
asfgit closed pull request #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance URL: https://github.com/apache/spark/pull/23262 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala index e214bfd050410..49fb288fdea6a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala @@ -18,54 +18,14 @@ package org.apache.spark.sql.execution import org.apache.spark.rdd.RDD -import org.apache.spark.sql.{Encoder, Row, SparkSession} -import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow} +import org.apache.spark.sql.{Encoder, SparkSession} +import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.plans.physical.{Partitioning, UnknownPartitioning} import org.apache.spark.sql.catalyst.util.truncatedString import org.apache.spark.sql.execution.metric.SQLMetrics -import org.apache.spark.sql.types.DataType - -object RDDConversions { - def productToRowRdd[A <: Product](data: RDD[A], outputTypes: Seq[DataType]): RDD[InternalRow] = { -data.mapPartitions { iterator => - val numColumns = outputTypes.length - val mutableRow = new GenericInternalRow(numColumns) - val converters = outputTypes.map(CatalystTypeConverters.createToCatalystConverter) - iterator.map { r => -var i = 0 -while (i < numColumns) { - mutableRow(i) = converters(i)(r.productElement(i)) - i += 1 -} - -mutableRow - } -} - } - - /** - * Convert the objects inside Row into the types Catalyst expected. - */ - def rowToRowRdd(data: RDD[Row], outputTypes: Seq[DataType]): RDD[InternalRow] = { -data.mapPartitions { iterator => - val numColumns = outputTypes.length - val mutableRow = new GenericInternalRow(numColumns) - val converters = outputTypes.map(CatalystTypeConverters.createToCatalystConverter) - iterator.map { r => -var i = 0 -while (i < numColumns) { - mutableRow(i) = converters(i)(r(i)) - i += 1 -} - -mutableRow - } -} - } -} object ExternalRDD { diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index c6000442fae76..b304e2da6e1cf 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala @@ -29,11 +29,11 @@ import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow, Quali import org.apache.spark.sql.catalyst.CatalystTypeConverters.convertToScala import org.apache.spark.sql.catalyst.analysis._ import org.apache.spark.sql.catalyst.catalog._ +import org.apache.spark.sql.catalyst.encoders.RowEncoder import org.apache.spark.sql.catalyst.expressions import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.planning.PhysicalOperation import org.apache.spark.sql.catalyst.plans.logical.{InsertIntoDir, InsertIntoTable, LogicalPlan, Project} -import org.apache.spark.sql.catalyst.plans.physical.HashPartitioning import org.apache.spark.sql.catalyst.rules.Rule import org.apache.spark.sql.execution.{RowDataSourceScanExec, SparkPlan} import org.apache.spark.sql.execution.command._ @@ -416,7 +416,10 @@ case class DataSourceStrategy(conf: SQLConf) extends Strategy with Logging with output: Seq[Attribute], rdd: RDD[Row]): RDD[InternalRow] = { if (relation.relation.needConversion) { - execution.RDDConversions.rowToRowRdd(rdd, output.map(_.dataType)) + val converters = RowEncoder(StructType.fromAttributes(output)) + rdd.mapPartitions { iterator => +iterator.map(converters.toRow) + } } else { rdd.asInstanceOf[RDD[InternalRow]] } This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HyukjinKwon commented on issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance
HyukjinKwon commented on issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance URL: https://github.com/apache/spark/pull/23262#issuecomment-446042871 Weird. My remote is as below FWIW. ``` $ git remote -v apache https://gitbox.apache.org/repos/asf/spark.git (fetch) apache https://gitbox.apache.org/repos/asf/spark.git (push) ... ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on issue #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicates and RewritePredicateSubquery after OptimizeSubqueries
cloud-fan commented on issue #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicates and RewritePredicateSubquery after OptimizeSubqueries URL: https://github.com/apache/spark/pull/23211#issuecomment-446042787 Yes, since `PushdownLeftSemiOrAntiJoin` rule is useful without subquery. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon commented on issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance
HyukjinKwon commented on issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance URL: https://github.com/apache/spark/pull/23262#issuecomment-446042637 Oops, looks working fine after switching to gitbox to me. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on a change in pull request #23208: [SPARK-25530][SQL] data source v2 API refactor (batch write)
cloud-fan commented on a change in pull request #23208: [SPARK-25530][SQL] data source v2 API refactor (batch write) URL: https://github.com/apache/spark/pull/23208#discussion_r240448233 ## File path: sql/core/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java ## @@ -25,7 +25,10 @@ * The base interface for v2 data sources which don't have a real catalog. Implementations must * have a public, 0-arg constructor. * - * The major responsibility of this interface is to return a {@link Table} for read/write. + * The major responsibility of this interface is to return a {@link Table} for read/write. If you + * want to allow end-users to write data to non-existing tables via write APIs in `DataFrameWriter` + * with `SaveMode`, you must return a {@link Table} instance even if the table doesn't exist. The + * table schema can be empty in this case. Review comment: what about the file source behavior difference between `SaveMode.Append` and the new append operator? Are you saying we should accept it and ask users to change their code? file source is widely used with `df.write.save` API... This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance
cloud-fan commented on issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance URL: https://github.com/apache/spark/pull/23262#issuecomment-446041476 yea it didn't work. The PR is not merged. I'll try it later. cc @srowen do you hit the same issue? I already switched to gitbox This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on a change in pull request #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
cloud-fan commented on a change in pull request #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#discussion_r240447269 ## File path: sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java ## @@ -20,14 +20,27 @@ import org.apache.spark.annotation.Evolving; import org.apache.spark.sql.sources.v2.reader.Scan; import org.apache.spark.sql.sources.v2.reader.ScanBuilder; +import org.apache.spark.sql.types.StructType; /** - * An empty mix-in interface for {@link Table}, to indicate this table supports batch scan. - * - * If a {@link Table} implements this interface, its {@link Table#newScanBuilder(DataSourceOptions)} - * must return a {@link ScanBuilder} that builds {@link Scan} with {@link Scan#toBatch()} - * implemented. - * + * A mix-in interface for {@link Table} to provide data reading ability of batch processing. */ @Evolving -public interface SupportsBatchRead extends Table { } +public interface SupportsBatchRead extends Table { + + /** + * Returns the schema of this table. + */ + StructType schema(); Review comment: or we treat it as an invalid data source, and ask users to use `df.foreach` instead? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on a change in pull request #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
cloud-fan commented on a change in pull request #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#discussion_r240447132 ## File path: sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java ## @@ -20,14 +20,27 @@ import org.apache.spark.annotation.Evolving; import org.apache.spark.sql.sources.v2.reader.Scan; import org.apache.spark.sql.sources.v2.reader.ScanBuilder; +import org.apache.spark.sql.types.StructType; /** - * An empty mix-in interface for {@link Table}, to indicate this table supports batch scan. - * - * If a {@link Table} implements this interface, its {@link Table#newScanBuilder(DataSourceOptions)} - * must return a {@link ScanBuilder} that builds {@link Scan} with {@link Scan#toBatch()} - * implemented. - * + * A mix-in interface for {@link Table} to provide data reading ability of batch processing. */ @Evolving -public interface SupportsBatchRead extends Table { } +public interface SupportsBatchRead extends Table { + + /** + * Returns the schema of this table. + */ + StructType schema(); Review comment: @rdblue what about the use case in http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html ? i.e. the data source can write data of any schema, and it's not readable, it has no schema? Do you mean we should ask data source to return an empty schema for this case? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr
HeartSaVioR commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#discussion_r240446659 ## File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala ## @@ -246,13 +246,56 @@ private[yarn] class ExecutorRunnable( sys.env.get("SPARK_USER").foreach { user => val containerId = ConverterUtils.toString(c.getId) val address = c.getNodeHttpAddress -val baseUrl = s"$httpScheme$address/node/containerlogs/$containerId/$user" -env("SPARK_LOG_URL_STDERR") = s"$baseUrl/stderr?start=-4096" -env("SPARK_LOG_URL_STDOUT") = s"$baseUrl/stdout?start=-4096" +sparkConf.get(config.CUSTOM_LOG_URL) match { + case Some(customUrl) => +val pathVariables = ExecutorRunnable.buildPathVariables(httpScheme, address, + YarnConfiguration.getClusterId(conf), containerId, user) +val envNameToFileNameMap = Map("SPARK_LOG_URL_STDERR" -> "stderr", + "SPARK_LOG_URL_STDOUT" -> "stdout") +val logUrls = ExecutorRunnable.replaceLogUrls(customUrl, pathVariables, + envNameToFileNameMap) + +logUrls.foreach { case (envName, url) => + env(envName) = url +} + case None => +val baseUrl = s"$httpScheme$address/node/containerlogs/$containerId/$user" +env("SPARK_LOG_URL_STDERR") = s"$baseUrl/stderr?start=-4096" +env("SPARK_LOG_URL_STDOUT") = s"$baseUrl/stdout?start=-4096" + } } } env } } + +private[yarn] object ExecutorRunnable { + val LOG_URL_PATTERN_HTTP_SCHEME = "{{HttpScheme}}" Review comment: Ah OK. I'm in favor of avoiding to use string constant directly, but not strong opinion on it. Will address. And yes I can put them in a single method, but placing a new method into class will bring unnecessary burden to the test code, since ExecutorRunnable receives lots of parameters to be instantiated. If we want to add an end-to-end test (instantiating YARN cluster and running executors) we still need to instantiate ExecutorRunnable (I think we are already covering it from here [1]), but if we just want to make sure the logic works properly, we might want to keep this as new object and add a test against the object to avoid instantiating ExecutorRunnable. WDYT? 1. https://github.com/apache/spark/blob/05cf81e6de3d61ddb0af81cd179665693f23351f/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala#L442-L461 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr
HeartSaVioR commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#discussion_r240446659 ## File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala ## @@ -246,13 +246,56 @@ private[yarn] class ExecutorRunnable( sys.env.get("SPARK_USER").foreach { user => val containerId = ConverterUtils.toString(c.getId) val address = c.getNodeHttpAddress -val baseUrl = s"$httpScheme$address/node/containerlogs/$containerId/$user" -env("SPARK_LOG_URL_STDERR") = s"$baseUrl/stderr?start=-4096" -env("SPARK_LOG_URL_STDOUT") = s"$baseUrl/stdout?start=-4096" +sparkConf.get(config.CUSTOM_LOG_URL) match { + case Some(customUrl) => +val pathVariables = ExecutorRunnable.buildPathVariables(httpScheme, address, + YarnConfiguration.getClusterId(conf), containerId, user) +val envNameToFileNameMap = Map("SPARK_LOG_URL_STDERR" -> "stderr", + "SPARK_LOG_URL_STDOUT" -> "stdout") +val logUrls = ExecutorRunnable.replaceLogUrls(customUrl, pathVariables, + envNameToFileNameMap) + +logUrls.foreach { case (envName, url) => + env(envName) = url +} + case None => +val baseUrl = s"$httpScheme$address/node/containerlogs/$containerId/$user" +env("SPARK_LOG_URL_STDERR") = s"$baseUrl/stderr?start=-4096" +env("SPARK_LOG_URL_STDOUT") = s"$baseUrl/stdout?start=-4096" + } } } env } } + +private[yarn] object ExecutorRunnable { + val LOG_URL_PATTERN_HTTP_SCHEME = "{{HttpScheme}}" Review comment: Ah OK. I'm in favor of avoiding to use string constant directly, but not strong opinion on it. Will address. And yes I can put them in a single method, but placing a new method into class will bring unnecessary burden to the test code, since ExecutorRunnable receives lots of parameters to be instantiated. If we want to add an end-to-end test (instantiating YARN cluster and running executors) we still need to instantiate ExecutorRunnable (I think we are already covering it from here [1]), but if we just want to make sure the logic works properly, we might want to keep this as new object and add a test against the object to avoid instantiating ExecutorRunnable. WDYT? 1. https://github.com/apache/spark/blob/05cf81e6de3d61ddb0af81cd179665693f23351f/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala#L460 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon commented on issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance
HyukjinKwon commented on issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance URL: https://github.com/apache/spark/pull/23262#issuecomment-446040157 It doesn't work right now .. right? haha. I tried to merge but failed. Some how the apache spark repo looks down (?). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on issue #23271: [SPARK-26318][SQL] Enhance function merge performance in Row
cloud-fan commented on issue #23271: [SPARK-26318][SQL] Enhance function merge performance in Row URL: https://github.com/apache/spark/pull/23271#issuecomment-446040156 I have the same question. If it's not used by anyone, maybe we can deprecate it and remove it in the next release This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance
cloud-fan commented on issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance URL: https://github.com/apache/spark/pull/23262#issuecomment-446038519 thanks, merging to master! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23252: [SPARK-26239] File-based secret key loading for SASL.
AmplabJenkins removed a comment on issue #23252: [SPARK-26239] File-based secret key loading for SASL. URL: https://github.com/apache/spark/pull/23252#issuecomment-446038018 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99934/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr
HeartSaVioR commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#discussion_r240444758 ## File path: docs/running-on-yarn.md ## @@ -430,6 +430,21 @@ To use a custom metrics.properties for the application master and executors, upd See spark.yarn.config.gatewayPath. + + spark.yarn.custom.log.url + (none) + + Specifies custom spark log url for supporting external log service rather than NodeManager webapp address. + Spark will support some path variables via patterns. Supported patterns and allocated values are below: + + * `{{HttpScheme}}`: `http`/`https` according to YARN HTTP policy. (Configured via `yarn.http.policy`) + * `{{NodeHttpAddress}}`: HTTP URI of the node on Container. Review comment: My bad. It is `host:port` instead of URI which can be retrieved from `container.getNodeHttpAddress`. The representation of `node on container` is borrowed from javadoc of this method, but I'm OK to use anything more clarified. Will address. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23252: [SPARK-26239] File-based secret key loading for SASL.
AmplabJenkins removed a comment on issue #23252: [SPARK-26239] File-based secret key loading for SASL. URL: https://github.com/apache/spark/pull/23252#issuecomment-446038016 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23252: [SPARK-26239] File-based secret key loading for SASL.
AmplabJenkins commented on issue #23252: [SPARK-26239] File-based secret key loading for SASL. URL: https://github.com/apache/spark/pull/23252#issuecomment-446038018 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99934/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23252: [SPARK-26239] File-based secret key loading for SASL.
AmplabJenkins commented on issue #23252: [SPARK-26239] File-based secret key loading for SASL. URL: https://github.com/apache/spark/pull/23252#issuecomment-446038016 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA removed a comment on issue #23252: [SPARK-26239] File-based secret key loading for SASL.
SparkQA removed a comment on issue #23252: [SPARK-26239] File-based secret key loading for SASL. URL: https://github.com/apache/spark/pull/23252#issuecomment-445983502 **[Test build #99934 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99934/testReport)** for PR 23252 at commit [`9ccff66`](https://github.com/apache/spark/commit/9ccff66b14946f119a870416176b7614f28b37c1). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23252: [SPARK-26239] File-based secret key loading for SASL.
SparkQA commented on issue #23252: [SPARK-26239] File-based secret key loading for SASL. URL: https://github.com/apache/spark/pull/23252#issuecomment-446037823 **[Test build #99934 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99934/testReport)** for PR 23252 at commit [`9ccff66`](https://github.com/apache/spark/commit/9ccff66b14946f119a870416176b7614f28b37c1). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr
HeartSaVioR commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#discussion_r240444139 ## File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala ## @@ -246,13 +246,56 @@ private[yarn] class ExecutorRunnable( sys.env.get("SPARK_USER").foreach { user => val containerId = ConverterUtils.toString(c.getId) val address = c.getNodeHttpAddress -val baseUrl = s"$httpScheme$address/node/containerlogs/$containerId/$user" -env("SPARK_LOG_URL_STDERR") = s"$baseUrl/stderr?start=-4096" -env("SPARK_LOG_URL_STDOUT") = s"$baseUrl/stdout?start=-4096" +sparkConf.get(config.CUSTOM_LOG_URL) match { + case Some(customUrl) => +val pathVariables = ExecutorRunnable.buildPathVariables(httpScheme, address, + YarnConfiguration.getClusterId(conf), containerId, user) +val envNameToFileNameMap = Map("SPARK_LOG_URL_STDERR" -> "stderr", + "SPARK_LOG_URL_STDOUT" -> "stdout") +val logUrls = ExecutorRunnable.replaceLogUrls(customUrl, pathVariables, + envNameToFileNameMap) + +logUrls.foreach { case (envName, url) => + env(envName) = url +} + case None => +val baseUrl = s"$httpScheme$address/node/containerlogs/$containerId/$user" Review comment: Yes it will remove the branch. Will address. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning
AmplabJenkins removed a comment on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning URL: https://github.com/apache/spark/pull/23249#issuecomment-446037380 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5946/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning
SparkQA commented on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning URL: https://github.com/apache/spark/pull/23249#issuecomment-446037548 **[Test build #99941 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99941/testReport)** for PR 23249 at commit [`ddb82c3`](https://github.com/apache/spark/commit/ddb82c3c8822e647790b1a303f647f0bf6dc1c9d). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning
AmplabJenkins removed a comment on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning URL: https://github.com/apache/spark/pull/23249#issuecomment-446037378 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning
AmplabJenkins commented on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning URL: https://github.com/apache/spark/pull/23249#issuecomment-446037380 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5946/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning
AmplabJenkins commented on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning URL: https://github.com/apache/spark/pull/23249#issuecomment-446037378 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on a change in pull request #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning
cloud-fan commented on a change in pull request #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning URL: https://github.com/apache/spark/pull/23249#discussion_r240431621 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala ## @@ -118,10 +115,12 @@ case class HashClusteredDistribution( /** * Represents data where tuples have been ordered according to the `ordering` - * [[Expression Expressions]]. This is a strictly stronger guarantee than - * [[ClusteredDistribution]] as an ordering will ensure that tuples that share the - * same value for the ordering expressions are contiguous and will never be split across - * partitions. + * [[Expression Expressions]]. Its requirement is defined as the following: + * - Given any 2 adjacent partitions, all the rows of the second partition must be larger than or + * equal to any row in the first partition, according to the `ordering` expressions. Review comment: @hvanhovell We need this relaxed requirement, otherwise we have to remove the optimization [here](https://github.com/apache/spark/pull/23249/files#diff-14e5db1a50efdaf9c760aeb5ced4R272) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on issue #23232: [SPARK-26233][SQL][BACKPORT-2.4] CheckOverflow when encoding a decimal value
cloud-fan commented on issue #23232: [SPARK-26233][SQL][BACKPORT-2.4] CheckOverflow when encoding a decimal value URL: https://github.com/apache/spark/pull/23232#issuecomment-446036494 Hi @mgaido91 , just one more question. Without this patch, does Spark always return wrong result if the actual decimal doesn't fix the precision? even for simple operations like `df.show`? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon removed a comment on issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance
HyukjinKwon removed a comment on issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance URL: https://github.com/apache/spark/pull/23262#issuecomment-446030243 Merged to master. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon commented on issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance
HyukjinKwon commented on issue #23262: [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance URL: https://github.com/apache/spark/pull/23262#issuecomment-446030243 Merged to master. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr
HeartSaVioR commented on a change in pull request #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#discussion_r240436930 ## File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala ## @@ -246,13 +246,56 @@ private[yarn] class ExecutorRunnable( sys.env.get("SPARK_USER").foreach { user => val containerId = ConverterUtils.toString(c.getId) val address = c.getNodeHttpAddress -val baseUrl = s"$httpScheme$address/node/containerlogs/$containerId/$user" -env("SPARK_LOG_URL_STDERR") = s"$baseUrl/stderr?start=-4096" -env("SPARK_LOG_URL_STDOUT") = s"$baseUrl/stdout?start=-4096" +sparkConf.get(config.CUSTOM_LOG_URL) match { + case Some(customUrl) => +val pathVariables = ExecutorRunnable.buildPathVariables(httpScheme, address, + YarnConfiguration.getClusterId(conf), containerId, user) +val envNameToFileNameMap = Map("SPARK_LOG_URL_STDERR" -> "stderr", + "SPARK_LOG_URL_STDOUT" -> "stdout") +val logUrls = ExecutorRunnable.replaceLogUrls(customUrl, pathVariables, + envNameToFileNameMap) + +logUrls.foreach { case (envName, url) => + env(envName) = url +} + case None => +val baseUrl = s"$httpScheme$address/node/containerlogs/$containerId/$user" +env("SPARK_LOG_URL_STDERR") = s"$baseUrl/stderr?start=-4096" +env("SPARK_LOG_URL_STDOUT") = s"$baseUrl/stdout?start=-4096" + } } } env } } + +private[yarn] object ExecutorRunnable { + val LOG_URL_PATTERN_HTTP_SCHEME = "{{HttpScheme}}" + val LOG_URL_PATTERN_NODE_HTTP_ADDRESS = "{{NodeHttpAddress}}" + val LOG_URL_PATTERN_CLUSTER_ID = "{{ClusterId}}" + val LOG_URL_PATTERN_CONTAINER_ID = "{{ContainerId}}" + val LOG_URL_PATTERN_USER = "{{User}}" + val LOG_URL_PATTERN_FILE_NAME = "{{FileName}}" + + def buildPathVariables(httpScheme: String, nodeHttpAddress: String, clusterId: String, Review comment: I guess it's allowed when it fits within two-lines, but no problem to change it. Will address. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] devaraj-kavali commented on issue #19616: [SPARK-22404][YARN] Provide an option to use unmanaged AM in yarn-client mode
devaraj-kavali commented on issue #19616: [SPARK-22404][YARN] Provide an option to use unmanaged AM in yarn-client mode URL: https://github.com/apache/spark/pull/19616#issuecomment-446029794 Thanks @vanzin for taking time to look into this, will update it with the changes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly
AmplabJenkins commented on issue #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly URL: https://github.com/apache/spark/pull/23277#issuecomment-446028069 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5945/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly
AmplabJenkins removed a comment on issue #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly URL: https://github.com/apache/spark/pull/23277#issuecomment-446028068 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly
SparkQA commented on issue #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly URL: https://github.com/apache/spark/pull/23277#issuecomment-446028080 **[Test build #99940 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99940/testReport)** for PR 23277 at commit [`0e00aa7`](https://github.com/apache/spark/commit/0e00aa7a219805f3d14ca4d222df4a922a34d825). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly
AmplabJenkins removed a comment on issue #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly URL: https://github.com/apache/spark/pull/23277#issuecomment-446028069 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5945/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly
AmplabJenkins commented on issue #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly URL: https://github.com/apache/spark/pull/23277#issuecomment-446028068 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen closed pull request #23168: [SPARK-26207][doc]add PowerIterationClustering (PIC) doc in 2.4 branch
srowen closed pull request #23168: [SPARK-26207][doc]add PowerIterationClustering (PIC) doc in 2.4 branch URL: https://github.com/apache/spark/pull/23168 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/docs/ml-clustering.md b/docs/ml-clustering.md index 1186fb73d0faf..d345512d2b8e8 100644 --- a/docs/ml-clustering.md +++ b/docs/ml-clustering.md @@ -265,3 +265,38 @@ Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details. + +## Power Iteration Clustering (PIC) + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm +developed by http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and Cohen. +From the abstract: PIC finds a very low-dimensional embedding of a dataset +using truncated power iteration on a normalized pair-wise similarity matrix of the data. + +`spark.ml`'s PowerIterationClustering implementation takes the following parameters: + +* `k`: the number of clusters to create +* `initMode`: param for the initialization algorithm +* `maxIter`: param for maximum number of iterations +* `srcCol`: param for the name of the input column for source vertex IDs +* `dstCol`: name of the input column for destination vertex IDs +* `weightCol`: Param for weight column name + +**Examples** + + + + +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.PowerIterationClustering) for more details. + +{% include_example scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %} + + + +Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/PowerIterationClustering.html) for more details. + +{% include_example java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %} + + + + This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on issue #23168: [SPARK-26207][doc]add PowerIterationClustering (PIC) doc in 2.4 branch
srowen commented on issue #23168: [SPARK-26207][doc]add PowerIterationClustering (PIC) doc in 2.4 branch URL: https://github.com/apache/spark/pull/23168#issuecomment-446027874 Merged to 2.4 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] icexelloss commented on a change in pull request #22305: [SPARK-24561][SQL][Python] User-defined window aggregation functions with Pandas UDF (bounded window)
icexelloss commented on a change in pull request #22305: [SPARK-24561][SQL][Python] User-defined window aggregation functions with Pandas UDF (bounded window) URL: https://github.com/apache/spark/pull/22305#discussion_r240434370 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala ## @@ -144,24 +282,107 @@ case class WindowInPandasExec( queue.close() } - val inputProj = UnsafeProjection.create(allInputs, child.output) - val pythonInput = grouped.map { case (_, rows) => -rows.map { row => - queue.add(row.asInstanceOf[UnsafeRow]) - inputProj(row) + val stream = iter.map { row => +queue.add(row.asInstanceOf[UnsafeRow]) +row + } + + val pythonInput = new Iterator[Iterator[UnsafeRow]] { + +// Manage the stream and the grouping. +var nextRow: UnsafeRow = null +var nextGroup: UnsafeRow = null +var nextRowAvailable: Boolean = false +private[this] def fetchNextRow() { + nextRowAvailable = stream.hasNext + if (nextRowAvailable) { +nextRow = stream.next().asInstanceOf[UnsafeRow] +nextGroup = grouping(nextRow) + } else { +nextRow = null +nextGroup = null + } +} +fetchNextRow() + +// Manage the current partition. +val buffer: ExternalAppendOnlyUnsafeRowArray = + new ExternalAppendOnlyUnsafeRowArray(inMemoryThreshold, spillThreshold) +var bufferIterator: Iterator[UnsafeRow] = _ + +val indexRow = new SpecificInternalRow(Array.fill(numBoundIndices)(IntegerType)) + +val frames = factories.map(_(indexRow)) + +private[this] def fetchNextPartition() { + // Collect all the rows in the current partition. + // Before we start to fetch new input rows, make a copy of nextGroup. + val currentGroup = nextGroup.copy() + + // clear last partition + buffer.clear() + + while (nextRowAvailable && nextGroup == currentGroup) { Review comment: @hvanhovell Thanks for chiming in. I am not very familiar with unsafe row comparison and was just following @ueshin suggestion. If this is not needed. I can close #23279 and leave it as is. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen closed pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
srowen closed pull request #23072: [SPARK-19827][R]spark.ml R API for PIC URL: https://github.com/apache/spark/pull/23072 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index de56061b4c1c7..afb3b542b6c6d 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -67,7 +67,8 @@ exportMethods("glm", "spark.fpGrowth", "spark.freqItemsets", "spark.associationRules", - "spark.findFrequentSequentialPatterns") + "spark.findFrequentSequentialPatterns", + "spark.assignClusters") # Job group lifecycle management methods export("setJobGroup", diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index cbed276274ac1..c2a6acef8ea48 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -1468,6 +1468,10 @@ setGeneric("spark.associationRules", function(object) { standardGeneric("spark.a setGeneric("spark.findFrequentSequentialPatterns", function(data, ...) { standardGeneric("spark.findFrequentSequentialPatterns") }) +#' @rdname spark.powerIterationClustering +setGeneric("spark.assignClusters", +function(data, ...) { standardGeneric("spark.assignClusters") }) + #' @param object a fitted ML model object. #' @param path the directory where the model is saved. #' @param ... additional argument(s) passed to the method. diff --git a/R/pkg/R/mllib_clustering.R b/R/pkg/R/mllib_clustering.R index 900be685824da..7d9dcebfe70d3 100644 --- a/R/pkg/R/mllib_clustering.R +++ b/R/pkg/R/mllib_clustering.R @@ -41,6 +41,12 @@ setClass("KMeansModel", representation(jobj = "jobj")) #' @note LDAModel since 2.1.0 setClass("LDAModel", representation(jobj = "jobj")) +#' S4 class that represents a PowerIterationClustering +#' +#' @param jobj a Java object reference to the backing Scala PowerIterationClustering +#' @note PowerIterationClustering since 3.0.0 +setClass("PowerIterationClustering", slots = list(jobj = "jobj")) + #' Bisecting K-Means Clustering Model #' #' Fits a bisecting k-means clustering model against a SparkDataFrame. @@ -610,3 +616,59 @@ setMethod("write.ml", signature(object = "LDAModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PowerIterationClustering +#' +#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to +#' return a cluster assignment for each input vertex. +#' +# Run the PIC algorithm and returns a cluster assignment for each input vertex. +#' @param data a SparkDataFrame. +#' @param k the number of clusters to create. +#' @param initMode the initialization algorithm. +#' @param maxIter the maximum number of iterations. +#' @param sourceCol the name of the input column for source vertex IDs. +#' @param destinationCol the name of the input column for destination vertex IDs +#' @param weightCol weight column name. If this is not set or \code{NULL}, +#' we treat all instance weights as 1.0. +#' @param ... additional argument(s) passed to the method. +#' @return A dataset that contains columns of vertex id and the corresponding cluster for the id. +#' The schema of it will be: +#' \code{id: Long} +#' \code{cluster: Int} +#' @rdname spark.powerIterationClustering +#' @aliases assignClusters,PowerIterationClustering-method,SparkDataFrame-method +#' @examples +#' \dontrun{ +#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), +#'list(1L, 2L, 1.0), list(3L, 4L, 1.0), +#'list(4L, 0L, 0.1)), +#' schema = c("src", "dst", "weight")) +#' clusters <- spark.assignClusters(df, initMode="degree", weightCol="weight") +#' showDF(clusters) +#' } +#' @note spark.assignClusters(SparkDataFrame) since 3.0.0 +setMethod("spark.assignClusters", + signature(data = "SparkDataFrame"), + function(data, k = 2L, initMode = c("random", "degree"), maxIter = 20L, +sourceCol = "src", destinationCol = "dst", weightCol = NULL) { +if (!is.numeric(k) || k < 1) { + stop("k should be a number with value >= 1.") +} +if (!is.integer(maxIter) || maxIter <= 0) { + stop("maxIter should be a number with value > 0.") +} +initMode <- match.arg(initMode) +if (!is.null(weightCol) && weightCol == "") { + weightCol <- NULL +} else if (!is.null(weightCol)) { + weightCol <- as.character(weightCol) +} +jobj <- callJStatic("org.apache.spark.ml.r.PowerIterationClusteringWrapper", +"get
[GitHub] srowen closed pull request #22683: [SPARK-25696] The storage memory displayed on spark Application UI is…
srowen closed pull request #22683: [SPARK-25696] The storage memory displayed on spark Application UI is… URL: https://github.com/apache/spark/pull/22683 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R index e99136723f65b..0207f249f9aa0 100644 --- a/R/pkg/R/context.R +++ b/R/pkg/R/context.R @@ -87,7 +87,7 @@ objectFile <- function(sc, path, minPartitions = NULL) { #' in the list are split into \code{numSlices} slices and distributed to nodes #' in the cluster. #' -#' If size of serialized slices is larger than spark.r.maxAllocationLimit or (200MB), the function +#' If size of serialized slices is larger than spark.r.maxAllocationLimit or (200MiB), the function #' will write it to disk and send the file name to JVM. Also to make sure each slice is not #' larger than that limit, number of slices may be increased. #' diff --git a/R/pkg/R/mllib_tree.R b/R/pkg/R/mllib_tree.R index 0e60842dd44c8..9844061cfd074 100644 --- a/R/pkg/R/mllib_tree.R +++ b/R/pkg/R/mllib_tree.R @@ -157,7 +157,7 @@ print.summary.decisionTree <- function(x) { #' @param checkpointInterval Param for set checkpoint interval (>= 1) or disable checkpoint (-1). #' Note: this setting will be ignored if the checkpoint directory is not #' set. -#' @param maxMemoryInMB Maximum memory in MB allocated to histogram aggregation. +#' @param maxMemoryInMB Maximum memory in MiB allocated to histogram aggregation. #' @param cacheNodeIds If FALSE, the algorithm will pass trees to executors to match instances with #' nodes. If TRUE, the algorithm will cache node IDs for each instance. Caching #' can speed up training of deeper trees. Users can set how often should the @@ -382,7 +382,7 @@ setMethod("write.ml", signature(object = "GBTClassificationModel", path = "chara #' @param checkpointInterval Param for set checkpoint interval (>= 1) or disable checkpoint (-1). #' Note: this setting will be ignored if the checkpoint directory is not #' set. -#' @param maxMemoryInMB Maximum memory in MB allocated to histogram aggregation. +#' @param maxMemoryInMB Maximum memory in MiB allocated to histogram aggregation. #' @param cacheNodeIds If FALSE, the algorithm will pass trees to executors to match instances with #' nodes. If TRUE, the algorithm will cache node IDs for each instance. Caching #' can speed up training of deeper trees. Users can set how often should the @@ -588,7 +588,7 @@ setMethod("write.ml", signature(object = "RandomForestClassificationModel", path #' @param checkpointInterval Param for set checkpoint interval (>= 1) or disable checkpoint (-1). #' Note: this setting will be ignored if the checkpoint directory is not #' set. -#' @param maxMemoryInMB Maximum memory in MB allocated to histogram aggregation. +#' @param maxMemoryInMB Maximum memory in MiB allocated to histogram aggregation. #' @param cacheNodeIds If FALSE, the algorithm will pass trees to executors to match instances with #' nodes. If TRUE, the algorithm will cache node IDs for each instance. Caching #' can speed up training of deeper trees. Users can set how often should the diff --git a/core/src/main/resources/org/apache/spark/ui/static/utils.js b/core/src/main/resources/org/apache/spark/ui/static/utils.js index deeafad4eb5f5..22985e31a7808 100644 --- a/core/src/main/resources/org/apache/spark/ui/static/utils.js +++ b/core/src/main/resources/org/apache/spark/ui/static/utils.js @@ -40,9 +40,9 @@ function formatDuration(milliseconds) { function formatBytes(bytes, type) { if (type !== 'display') return bytes; if (bytes == 0) return '0.0 B'; -var k = 1000; +var k = 1024; var dm = 1; -var sizes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB']; +var sizes = ['B', 'KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']; var i = Math.floor(Math.log(bytes) / Math.log(k)); return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + sizes[i]; } diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala index 845a3d5f6d6f9..696dafda6d1ec 100644 --- a/core/src/main/scala/org/apache/spark/SparkContext.scala +++ b/core/src/main/scala/org/apache/spark/SparkContext.scala @@ -1043,7 +1043,7 @@ class SparkContext(config: SparkConf) extends Logging { // See SPARK-11227 for details. FileSystem.getLocal(hadoopConfiguration) -// A Hadoop configuration can be about 10 KB, which
[GitHub] cloud-fan commented on issue #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly
cloud-fan commented on issue #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly URL: https://github.com/apache/spark/pull/23277#issuecomment-446027024 good catch! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on issue #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly
cloud-fan commented on issue #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly URL: https://github.com/apache/spark/pull/23277#issuecomment-446027053 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on issue #23072: [SPARK-19827][R]spark.ml R API for PIC
srowen commented on issue #23072: [SPARK-19827][R]spark.ml R API for PIC URL: https://github.com/apache/spark/pull/23072#issuecomment-446026964 Merged to master This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] srowen commented on issue #22683: [SPARK-25696] The storage memory displayed on spark Application UI is…
srowen commented on issue #22683: [SPARK-25696] The storage memory displayed on spark Application UI is… URL: https://github.com/apache/spark/pull/22683#issuecomment-446026600 Merged to master This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on a change in pull request #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning
cloud-fan commented on a change in pull request #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning URL: https://github.com/apache/spark/pull/23249#discussion_r240431799 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala ## @@ -118,10 +115,12 @@ case class HashClusteredDistribution( /** * Represents data where tuples have been ordered according to the `ordering` - * [[Expression Expressions]]. This is a strictly stronger guarantee than - * [[ClusteredDistribution]] as an ordering will ensure that tuples that share the - * same value for the ordering expressions are contiguous and will never be split across - * partitions. + * [[Expression Expressions]]. Its requirement is defined as the following: + * - Given any 2 adjacent partitions, all the rows of the second partition must be larger than or + * equal to any row in the first partition, according to the `ordering` expressions. Review comment: I did not change the semantic, I just correct the comment to represent what the current semantic is. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on a change in pull request #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning
cloud-fan commented on a change in pull request #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning URL: https://github.com/apache/spark/pull/23249#discussion_r240431621 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala ## @@ -118,10 +115,12 @@ case class HashClusteredDistribution( /** * Represents data where tuples have been ordered according to the `ordering` - * [[Expression Expressions]]. This is a strictly stronger guarantee than - * [[ClusteredDistribution]] as an ordering will ensure that tuples that share the - * same value for the ordering expressions are contiguous and will never be split across - * partitions. + * [[Expression Expressions]]. Its requirement is defined as the following: + * - Given any 2 adjacent partitions, all the rows of the second partition must be larger than or + * equal to any row in the first partition, according to the `ordering` expressions. Review comment: @hvanhovell We need this relaxed requirement, otherwise we have to remove the optimization [here](https://github.com/apache/spark/pull/23249/files#diff-14e5db1a50efdaf9c760aeb5ced4R273) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keytab-based token renewal.
AmplabJenkins removed a comment on issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keytab-based token renewal. URL: https://github.com/apache/spark/pull/22911#issuecomment-446022247 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keytab-based token renewal.
AmplabJenkins removed a comment on issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keytab-based token renewal. URL: https://github.com/apache/spark/pull/22911#issuecomment-446022248 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99932/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keytab-based token renewal.
AmplabJenkins commented on issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keytab-based token renewal. URL: https://github.com/apache/spark/pull/22911#issuecomment-446022247 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keytab-based token renewal.
AmplabJenkins commented on issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keytab-based token renewal. URL: https://github.com/apache/spark/pull/22911#issuecomment-446022248 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99932/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] vanzin commented on issue #23220: [SPARK-25877][k8s] Move all feature logic to feature classes.
vanzin commented on issue #23220: [SPARK-25877][k8s] Move all feature logic to feature classes. URL: https://github.com/apache/spark/pull/23220#issuecomment-446022089 So, anybody interested in reviewing this? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA removed a comment on issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keytab-based token renewal.
SparkQA removed a comment on issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keytab-based token renewal. URL: https://github.com/apache/spark/pull/22911#issuecomment-445939970 **[Test build #99932 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99932/testReport)** for PR 22911 at commit [`4143603`](https://github.com/apache/spark/commit/414360315a484d7a32e6817ff2197cd1f3f8f43c). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #22279: [SPARK-25277][YARN] YARN applicationMaster metrics should not register static metrics
AmplabJenkins removed a comment on issue #22279: [SPARK-25277][YARN] YARN applicationMaster metrics should not register static metrics URL: https://github.com/apache/spark/pull/22279#issuecomment-446021735 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5944/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #22279: [SPARK-25277][YARN] YARN applicationMaster metrics should not register static metrics
AmplabJenkins removed a comment on issue #22279: [SPARK-25277][YARN] YARN applicationMaster metrics should not register static metrics URL: https://github.com/apache/spark/pull/22279#issuecomment-446021727 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keytab-based token renewal.
SparkQA commented on issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keytab-based token renewal. URL: https://github.com/apache/spark/pull/22911#issuecomment-446021887 **[Test build #99932 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99932/testReport)** for PR 22911 at commit [`4143603`](https://github.com/apache/spark/commit/414360315a484d7a32e6817ff2197cd1f3f8f43c). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class ArrowCollectSerializer(Serializer):` * `class CSVInferSchema(val options: CSVOptions) extends Serializable ` * `class InterpretedSafeProjection(expressions: Seq[Expression]) extends Projection ` * `sealed trait DateTimeFormatter ` * `class Iso8601DateTimeFormatter(` * `class LegacyDateTimeFormatter(` * `class LegacyFallbackDateTimeFormatter(` * `sealed trait DateFormatter ` * `class Iso8601DateFormatter(` * `class LegacyDateFormatter(` * `class LegacyFallbackDateFormatter(` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #22279: [SPARK-25277][YARN] YARN applicationMaster metrics should not register static metrics
AmplabJenkins commented on issue #22279: [SPARK-25277][YARN] YARN applicationMaster metrics should not register static metrics URL: https://github.com/apache/spark/pull/22279#issuecomment-446021727 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #22279: [SPARK-25277][YARN] YARN applicationMaster metrics should not register static metrics
AmplabJenkins commented on issue #22279: [SPARK-25277][YARN] YARN applicationMaster metrics should not register static metrics URL: https://github.com/apache/spark/pull/22279#issuecomment-446021735 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5944/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #22279: [SPARK-25277][YARN] YARN applicationMaster metrics should not register static metrics
SparkQA commented on issue #22279: [SPARK-25277][YARN] YARN applicationMaster metrics should not register static metrics URL: https://github.com/apache/spark/pull/22279#issuecomment-446021730 **[Test build #99939 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99939/testReport)** for PR 22279 at commit [`a990758`](https://github.com/apache/spark/commit/a99075873fa1519fe07344eb90d5c8af02b9d7e5). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] vanzin closed pull request #21279: [SPARK-24219][k8s] Improve the docker building script to avoid copying everything under examples to docker image
vanzin closed pull request #21279: [SPARK-24219][k8s] Improve the docker building script to avoid copying everything under examples to docker image URL: https://github.com/apache/spark/pull/21279 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/bin/docker-image-tool.sh b/bin/docker-image-tool.sh index f090240065bf1..7ded77426340b 100755 --- a/bin/docker-image-tool.sh +++ b/bin/docker-image-tool.sh @@ -44,15 +44,37 @@ function image_ref { function build { local BUILD_ARGS local IMG_PATH + local TMPFOLDER if [ ! -f "$SPARK_HOME/RELEASE" ]; then # Set image build arguments accordingly if this is a source repo and not a distribution archive. +local JARS="${SPARK_HOME}/assembly/target/scala-${SPARK_SCALA_VERSION}/jars" +TMPFOLDER=`mktemp -q -d examples.XX` +if [ $? -ne 0 ]; then + ehco "Cannot create temp folder, exiting..." + exit 1 +fi + +mkdir -p "${TMPFOLDER}/jars" +cp "${SPARK_HOME}"/examples/target/scala*/jars/* "${TMPFOLDER}/jars" +for f in "${TMPFOLDER}"/jars/*; do + name=$(basename "$f") + if [ -f "${JARS}/${name}" ]; then +rm "${TMPFOLDER}/jars/${name}" + fi +done + +mkdir -p "${TMPFOLDER}/src/main" +cp -r "${SPARK_HOME}/examples/src/main" "${TMPFOLDER}/src" + IMG_PATH=resource-managers/kubernetes/docker/src/main/dockerfiles BUILD_ARGS=( --build-arg img_path=$IMG_PATH --build-arg spark_jars=assembly/target/scala-$SPARK_SCALA_VERSION/jars + --build-arg + spark_examples=$TMPFOLDER ) else # Not passed as an argument to docker, but used to validate the Spark directory. @@ -69,6 +91,10 @@ function build { docker build "${BUILD_ARGS[@]}" \ -t $(image_ref spark) \ -f "$DOCKERFILE" . + + if [ -d "${TMPFOLDER}" ]; then +rm -fr "${TMPFOLDER}" + fi } function push { diff --git a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile index 9badf8556afc3..198f14f2955d1 100644 --- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile +++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile @@ -19,6 +19,7 @@ FROM openjdk:8-alpine ARG spark_jars=jars ARG img_path=kubernetes/dockerfiles +ARG spark_examples=examples # Before building the docker image, first build and make a Spark distribution following # the instructions in http://spark.apache.org/docs/latest/building-spark.html. @@ -41,7 +42,7 @@ COPY ${spark_jars} /opt/spark/jars COPY bin /opt/spark/bin COPY sbin /opt/spark/sbin COPY ${img_path}/spark/entrypoint.sh /opt/ -COPY examples /opt/spark/examples +COPY ${spark_examples} /opt/spark/examples COPY data /opt/spark/data ENV SPARK_HOME /opt/spark This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] vanzin commented on issue #21279: [SPARK-24219][k8s] Improve the docker building script to avoid copying everything under examples to docker image
vanzin commented on issue #21279: [SPARK-24219][k8s] Improve the docker building script to avoid copying everything under examples to docker image URL: https://github.com/apache/spark/pull/21279#issuecomment-446021270 This was fixed in SPARK-26025. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #22279: [SPARK-25277][YARN] YARN applicationMaster metrics should not register static metrics
AmplabJenkins removed a comment on issue #22279: [SPARK-25277][YARN] YARN applicationMaster metrics should not register static metrics URL: https://github.com/apache/spark/pull/22279#issuecomment-417219160 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] vanzin commented on issue #22279: [SPARK-25277][YARN] YARN applicationMaster metrics should not register static metrics
vanzin commented on issue #22279: [SPARK-25277][YARN] YARN applicationMaster metrics should not register static metrics URL: https://github.com/apache/spark/pull/22279#issuecomment-446020969 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] aokolnychyi commented on issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - foundation
aokolnychyi commented on issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - foundation URL: https://github.com/apache/spark/pull/21320#issuecomment-446020655 @mallman @dbtsai @gatorsmile One question on non-deterministic expressions. For example, let's consider a non-deterministic UDF. ``` val nonDeterministicUdf = udf((first: String) => first + " " + Math.random()).asNondeterministic() val query = data.select(col("id"), nonDeterministicUdf(col("name.first"))) ``` As it is today, there will be no schema pruning due to the way how `collectProjectsAndFilters` is defined in `PhysicalOperation`. ``` == Analyzed Logical Plan == id: int, UDF(name.first): string Project [id#222, UDF(name#223.first) AS UDF(name.first)#246] +- Project [id#222, name#223, address#224, pets#225, friends#226, relatives#227, employer#228, p#229] +- SubqueryAlias `contacts` +- Relation[id#222,name#223,address#224,pets#225,friends#226,relatives#227,employer#228,p#229] parquet == Optimized Logical Plan == Project [id#222, UDF(name#223.first) AS UDF(name.first)#246] +- Relation[id#222,name#223,address#224,pets#225,friends#226,relatives#227,employer#228,p#229] parquet == Physical Plan == *(1) Project [id#222, UDF(name#223.first) AS UDF(name.first)#246] +- *(1) FileScan parquet [id#222,name#223,address#224,pets#225,friends#226,relatives#227,employer#228,p#229] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/f3/6jyczfzd15ndvh49zq0d_sg8gn/T/spark-6b69e4e9-c6..., PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct,address:string,pets:int,friends... ``` To me, it seems valid to apply schema prunining in this case. What do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] vanzin commented on issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on executor side
vanzin commented on issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on executor side URL: https://github.com/apache/spark/pull/19840#issuecomment-446020608 @yaooqinn we should probably close this if you're not planning to look at the root of the problem, which seems to be on the python side. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] vanzin commented on a change in pull request #19616: [SPARK-22404][YARN] Provide an option to use unmanaged AM in yarn-client mode
vanzin commented on a change in pull request #19616: [SPARK-22404][YARN] Provide an option to use unmanaged AM in yarn-client mode URL: https://github.com/apache/spark/pull/19616#discussion_r240421106 ## File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ## @@ -1084,14 +1095,38 @@ private[spark] class Client( if (returnOnRunning && state == YarnApplicationState.RUNNING) { return createAppReport(report) } - + if (state == YarnApplicationState.ACCEPTED && isClientUnmanagedAMEnabled +&& !amServiceStarted && report.getAMRMToken != null) { +amServiceStarted = true +startApplicationMasterService(report) + } lastState = state } // Never reached, but keeps compiler happy throw new SparkException("While loop is depleted! This should never happen...") } + private def startApplicationMasterService(report: ApplicationReport) = { +// Add AMRMToken to establish connection between RM and AM +val token = report.getAMRMToken +val amRMToken: org.apache.hadoop.security.token.Token[AMRMTokenIdentifier] = + new org.apache.hadoop.security.token.Token[AMRMTokenIdentifier](token +.getIdentifier().array(), token.getPassword().array, new Text( +token.getKind()), new Text(token.getService())) +val currentUGI = UserGroupInformation.getCurrentUser +currentUGI.addToken(amRMToken) + +sparkConf.set("spark.yarn.containerId", + ContainerId.newContainerId(report.getCurrentApplicationAttemptId, 1).toString) +// Start Application Service in a separate thread and continue with application monitoring +val amService = new Thread() { Review comment: Thread name? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org