[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20146 Thanks. I rebased it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21533: [SPARK-24195][Core] Bug fix for local:/ path in SparkCon...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21533 **[Test build #4219 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4219/testReport)** for PR 21533 at commit [`eb46ccf`](https://github.com/apache/spark/commit/eb46ccfec084c2439a26eee38015381f091fe164). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21807: [SPARK-24536] Validate that limit clause cannot have a n...
Github user NiharS commented on the issue: https://github.com/apache/spark/pull/21807 New to SQL, but it seems like the query `SELECT 1 LIMIT CAST('1' AS INT)` should work, right? I tried both on Spark without to your change and the W3Schools SQL tester and it's accepted in both, but I tried on your PR and it hits the new AnalysisException. If this is indeed an issue, it can be avoided by having the case be `e.eval() == null` instead of `e.nullable`, although there's some duplicated work since the previous case also calls `e.eval()` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1104/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21447: [SPARK-24339][SQL]Add project for transform/map/reduce s...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21447 @maropu Do you want to take this over and add such a project in `ColumnPruning`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21656: [SPARK-24677][Core]Avoid NoSuchElementException from Med...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21656 merged thanks @cxzl25 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20146 **[Test build #93246 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93246/testReport)** for PR 20146 at commit [`c003bd3`](https://github.com/apache/spark/commit/c003bd3d6c58cf19249ff0ba9dd10140971d655c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21635: [SPARK-24594][YARN] Introducing metrics for YARN
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/21635#discussion_r203495430 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterSource.scala --- @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.yarn + +import com.codahale.metrics.{Gauge, MetricRegistry} + +import org.apache.spark.metrics.source.Source + +private[spark] class ApplicationMasterSource(yarnAllocator: YarnAllocator) extends Source { + + override val sourceName: String = "applicationMaster" --- End diff -- the config spark.metrics.namespace already exists. see the metrics section in http://spark.apache.org/docs/latest/monitoring.html. But if you look at the code https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L129 its only applied for executor and driver metrics. I think we should have it apply to the yarn metrics as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21468: [SPARK-22151] : PYTHONPATH not picked up from the spark....
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21468 merged thanks @pgandhi999 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21742: [SPARK-24768][SQL] Have a built-in AVRO data sour...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21742#discussion_r203496489 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala --- @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +package object avro { + /** + * Adds a method, `avro`, to DataFrameWriter that allows you to write avro files using + * the DataFileWriter + */ + implicit class AvroDataFrameWriter[T](writer: DataFrameWriter[T]) { +def avro: String => Unit = writer.format("avro").save --- End diff -- I'd remove the Scala API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21468: [SPARK-22151] : PYTHONPATH not picked up from the...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21468 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20146 Yeah, looks appveyer tests are triggered. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...
Github user edwinalu commented on a diff in the pull request: https://github.com/apache/spark/pull/21221#discussion_r203503691 --- Diff: core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala --- @@ -160,11 +160,29 @@ case class SparkListenerBlockUpdated(blockUpdatedInfo: BlockUpdatedInfo) extends * Periodic updates from executors. * @param execId executor id * @param accumUpdates sequence of (taskId, stageId, stageAttemptId, accumUpdates) + * @param executorUpdates executor level metrics updates */ @DeveloperApi case class SparkListenerExecutorMetricsUpdate( execId: String, -accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])]) +accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])], +executorUpdates: Option[Array[Long]] = None) + extends SparkListenerEvent + +/** + * Peak metric values for the executor for the stage, written to the history log at stage + * completion. + * @param execId executor id + * @param stageId stage id + * @param stageAttemptId stage attempt + * @param executorMetrics executor level metrics, indexed by MetricGetter.values + */ +@DeveloperApi +case class SparkListenerStageExecutorMetrics( +execId: String, +stageId: Int, +stageAttemptId: Int, +executorMetrics: Array[Long]) --- End diff -- This sounds like a good solution, with both a clean API, and also an efficient implementation that will be easier to add new metrics to. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21710 **[Test build #93245 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93245/testReport)** for PR 21710 at commit [`ec88d38`](https://github.com/apache/spark/commit/ec88d38f3bafb9740c7c9a3856ddc3a64cb16fe7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21795 **[Test build #93239 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93239/testReport)** for PR 21795 at commit [`c83eeeb`](https://github.com/apache/spark/commit/c83eeebc32d449412cc53951ad5b9611ac5f3268). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21710 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93245/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21710 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21795 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21795 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93239/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user echarles commented on the issue: https://github.com/apache/spark/pull/21748 > Note that we only invoke any of the feature steps and the entry point of KubernetesClientApplication if we run in cluster mode. If we run in client mode, we enter directly into the user's main class, or, the user is in a process that just created a Spark context from scratch with the right master URL (i.e. new SparkContext(k8s://my-master:8443)). If you wanted to create the headless service in client mode, you'd have to do it when instantiating the KubernetesClusterSchedulerBackend somewhere, probably before creating the ExecutorPodsAllocator so that you'd set spark.driver.host and spark.driver.port properly when telling the created executors where to find the driver via the --driver-url argument. For Out-Cluster we don't need any headless service, but for In-Cluster the user has to ensure that the driver is accessible via a service (headless or not) - I understand it may be preferable to align on spark other managers and to have coherent behaviour for k8s in/out cluster, but I still believe that we have to make life easy to users and devops. The notebook case is self explaining : For Zeppelin you may have a single pod (Web Server) that will create multiple client sessions (this is a common supported scenario), Jupyter with Toree may be configured to support multiple REPL (and if it not today, it will come on day). You also have the case of current livy which is a single REST Server managing a pool of REPL. With current PR, we push to all those 3rd party consumer the responsibility to create services, but all the hard work to assign and maintain ports to avoid collision on the single pod with multiple drivers. What about having a property that would enable/disable that headless service creation (could be disabled by default) - This is not yet the perfect solution as the user still would have to pass the ports and ensure there is no clash. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21809: [SPARK-24851] : Map a Stage ID to it's Associated Job ID...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21809 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21589 @MaxGekk The example you cites is literally one of a handful of usages which is not easily overridden - and is prefixed with a 'HACK ALERT' ! A few others are in mllib, typically for reading schema. I will reiterate the solutions available to users currently: * Rely on `defaultParallelism` - this gives the expected result, unless explicitly overridden by user. * If you need fine grained information about executors, use spark listener (it is trivial to keep a count with `onExecutorAdded`/`onExecutorRemoved`). * If you simply want a current value without own listener - use REST api to query for current executors. Having said this, I will caution against this approach if you are concerned about performance. `defaultParallelism` exists to give a default when user does not explicitly override when creating an `RDD` : and reflects the current number of executors. Particularly when dynamic resource allocation is enabled, this value is not optimal : spark will acquire or release resources based on pending tasks. Using available cluster resources (from cluster manager - not spark) as a way to model parallelism would be a better approach : externalize your config's and populate based on resources available to application (in your example: difference between test/staging/production). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21809: [SPARK-24851] : Map a Stage ID to it's Associated Job ID...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21809 **[Test build #93247 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93247/testReport)** for PR 21809 at commit [`7be0520`](https://github.com/apache/spark/commit/7be05201239c9fec7133e8f09455224786e91b59). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user echarles commented on the issue: https://github.com/apache/spark/pull/21748 PS: Actually, there would even be no issue with the port assignment as Spark knows which ports he will be using, so he can create the headless service with the correct ports for the user. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21806: [SPARK-24846][SQL] Made hashCode ExprId independent of j...
Github user gvr commented on the issue: https://github.com/apache/spark/pull/21806 Updated description, thanks @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/21748 @echarles I don't think we should be special-casing Kubernetes here as being any different from the other cluster managers. The main point of client mode is that the driver is running locally and we make no assumptions about network connectivity. Deploying notebooks on other cluster managers which have network isolation between the driver and the executors would require the user to manually configure their own connectivity too. There's another problem in creating the headless service here: when you create a service, you have to back that service with pods that have certain labels; that is, a label selector. However, labels are not unique for a given pod; multiple pods can have the same label key-value pairs. Now, when Spark Submit has explicit control over the pod that is created as well as the service, we make it such that the driver pod has a unique label corresponding to an application id, and then the service can be ensured to be backed by only that driver pod. If we don't control the creation of the driver pod, we can't guarantee that whatever service we create would route to only the driver pod running the application in question. It is therefore unclear what the label selector would be for the headless service we create automatically. You could allow the user to have some control over the configuration of the headless service. For example, we could have a setting indicating the label selector the headless service should be created with. Or, the documentation can specifically require the driver pod to have a set of labels which are strictly unique to that pod in that namespace. In both cases, I would argue that we end up with the same complexity and mental burden, if not more so, than having the user deploy a headless service that gives their driver pod (and only that driver pod) a stable hostname. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/21748 Though I suppose you could have the driver patch its own metadata fields to assign itself a unique label. I could see that being confusing to users when they observe that their driver pod metadata is modified from what they originally deployed. Nevertheless, it's a tenable option. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21806: [SPARK-24846][SQL] Made hashCode ExprId independent of j...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21806 **[Test build #93240 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93240/testReport)** for PR 21806 at commit [`68d6f19`](https://github.com/apache/spark/commit/68d6f19ad9baf78f7ffba803e7709a509063d1ff). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class ExprId(id: Long, jvmId: UUID) ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21806: [SPARK-24846][SQL] Made hashCode ExprId independent of j...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21806 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93240/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21806: [SPARK-24846][SQL] Made hashCode ExprId independent of j...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21806 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user echarles commented on the issue: https://github.com/apache/spark/pull/21748 @mccheah If I compare with yarn-client with all nodes on the same LAN, we introduce complexity here as the user has to ensure not only configuration, but also deployment of a particular resource. I am not sure that yarn devops would be happy to be obliged to deploy let's say a yarn container or yarn app for each notebook before being able to use it. We do introduce complexity with k8s-client compared to yarn-client. About selecting the pod with labels, another approach I have taken is simply using the name of the driver pod, a bit like I have done with the following deployment (so no need to ensure labels - the ports are the ports assigned by spark that the code can retrieve). ``` apiVersion: v1 kind: Service metadata: name: spark-driver-service spec: clusterIP: None ports: - port: 7077 name: spark-driver-port - port: 1 name: spark-driver-blockmanager-port selector: run: spark-pod ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21720: [SPARK-24163][SPARK-24164][SQL] Support column list as t...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21720 LGTM Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/21221#discussion_r203520320 --- Diff: core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala --- @@ -160,11 +160,29 @@ case class SparkListenerBlockUpdated(blockUpdatedInfo: BlockUpdatedInfo) extends * Periodic updates from executors. * @param execId executor id * @param accumUpdates sequence of (taskId, stageId, stageAttemptId, accumUpdates) + * @param executorUpdates executor level metrics updates */ @DeveloperApi case class SparkListenerExecutorMetricsUpdate( execId: String, -accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])]) +accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])], +executorUpdates: Option[Array[Long]] = None) + extends SparkListenerEvent + +/** + * Peak metric values for the executor for the stage, written to the history log at stage + * completion. + * @param execId executor id + * @param stageId stage id + * @param stageAttemptId stage attempt + * @param executorMetrics executor level metrics, indexed by MetricGetter.values + */ +@DeveloperApi +case class SparkListenerStageExecutorMetrics( +execId: String, +stageId: Int, +stageAttemptId: Int, +executorMetrics: Array[Long]) --- End diff -- e ... actually just realized EnumMap is always mutable. If we want the public api to be totally immutable, back to private field & constructor, and a getter. so might as well keep the field an `Array[Long]` since its private anyway. IMO doesn't really matter if `ExecutorMetricType` is an enum or a sealed scala trait. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/21589 @mridulm scheduler pools could also make the cluster-wide resource numbers not very meaningful. I don't think the maxShare work has been merged yet (kind of a stalled TODO on an open PR, IIRC), but once that is in, it's not terribly useful to know, e.g., that there are 5 million cores in the cluster if your Job is running in a scheduler pool that is restricted to using far fewer CPUs via the pool's maxShares. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/21748 > About selecting the pod with labels, another approach I have taken is simply using the name of the driver pod, a bit like I have done with the following deployment (so no need to ensure labels - the ports are the ports assigned by spark that the code can retrieve). I don't think you can back a service with a selector that's a pod's name, but someone with more knowledge of the Service API might be able to correct me here. I was under the impression one had to use labels. In your example, the service would match any pod with the label key of `run` being equal to `spark-pod`, which isn't guaranteed to map to a single unique pod. In spark-submit we set `spark-app-id` to a unique identifier. > If I compare with yarn-client with all nodes on the same LAN But if you run a YARN application with the driver not being on the same network, then the user has to set up their own connectivity. In Kubernetes that kind of networking setup happens to come up more often, perhaps, but it's not enough reason to introduce the complexity of it. Another situation where we want the driver to not be making the headless service is in a world where the driver shouldn't have permission to create service objects, but can have permission to create pod objects. Adding a flag allowing the driver to create the headless service would implicitly change the required permissions of the application. This is more work to document and more for the application writer to consider. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21720: [SPARK-24163][SPARK-24164][SQL] Support column li...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21720 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21729: [SPARK-24755][Core] Executor loss can cause task to not ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21729 **[Test build #93248 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93248/testReport)** for PR 21729 at commit [`f9ed226`](https://github.com/apache/spark/commit/f9ed2260eb0934820bbcda2229fb7406efd5c004). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/21748 > I don't think you can back a service with a selector that's a pod's name, but someone with more knowledge of the Service API might be able to correct me here. I was under the impression one had to use labels. Yes, the service gets its endpoints by matching its label selector against labels on the pods so it's critical to have matching labels. Another tenable solution is for the driver backend code to get the pod object of the driver and sets the label selector of the service based on the labels on the driver pod. > don't think we should be special-casing Kubernetes here as being any different from the other cluster managers. The main point of client mode is that the driver is running locally and we make no assumptions about network connectivity. Deploying notebooks on other cluster managers which have network isolation between the driver and the executors would require the user to manually configure their own connectivity too. I would back-step a bit and argue that Kubernetes is unique in this regard because of the need of a service for in-cluster client mode to work and the service is a Kubernetes resource. For this reason, I don't think it makes no sense for us to do a bit more for the users if technically it's not impossible to do so. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/21748 > Yes, the service gets its endpoints by matching its label selector against labels on the pods so it's critical to have matching labels. Another tenable solution is for the driver backend code to get the pod object of the driver and sets the label selector of the service based on the labels on the driver pod. The problem is that the driver's labels might not be unique to that driver, which therefore would require the user to assign their own unique labels or for us to patch the driver pod in-place to assign it a unique label. Neither seem ideal. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21729: [SPARK-24755][Core] Executor loss can cause task to not ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21729 **[Test build #93249 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93249/testReport)** for PR 21729 at commit [`6316e5b`](https://github.com/apache/spark/commit/6316e5b9823dfe0b04b4521a9ef7d9516e9b8289). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21806: [SPARK-24846][SQL] Made hashCode ExprId independent of j...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21806 The change looks fine. However I'm wondering that have we have chance to compare hash code between expr ids from different jvms? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/21748 > The problem is that the driver's labels might not be unique to that driver, which therefore would require the user to assign their own unique labels or for us to patch the driver pod in-place to assign it a unique label. Neither seem ideal. This exact problem still exists if users would have to create the service themselves: they would still need to make sure to use label(s) that can tell the driver/service combination apart from other apps. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/21589 > ... unless explicitly overridden by user. This is the problem this PR addresses, actually. > If you need fine grained information about executors, use spark listener (it is trivial to keep a count with onExecutorAdded/onExecutorRemoved) Do you really believe this is right approach? Instead of `spark.sparkContext.numCores` you propose to our users to figure out how to register a listener, store somewhere results (in thread safe manner?) and keep it updated. Seriously? > If you simply want a current value without own listener - use REST api to query for current executors. I hope you know the POLA principle: https://en.wikipedia.org/wiki/Principle_of_least_astonishment . Could you imagine you are writing some code in local or remote notebook in R. First of all are you sure the REST API is available to users? Comparing to one line call, how many lines and effort from users may take calling the REST API and parsing results? Highly likely users will just put some constant (like you proposed even from config) and will get overloaded/underloaded cluster. > defaultParallelism exists to give a default when user does not explicitly override when creating an RDD : and reflects the current number of executors. One more time, the `numCores()` method aims to solve the problem when `defaultParallelism` is set explicitly. As I show you on an use case above, the `defaultParallelism` can be changed in one part of app (for example in some library - not available to user) and number of cores is needed in another part. > externalize your config's and populate based on resources available to application I have already described you one use cases when a notebook can be attached to many clusters. The same notebook can be reused/called from another notebooks. This _"externalize your config"_ could sound perfect from developer's perspective but not from customer's . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user echarles commented on the issue: https://github.com/apache/spark/pull/21748 Got you points. About labels, right, we could take the road of the code that creates labels on its own pod. To ensure uniqueness, we could use the `spark-app-id` as key (if it maps the requirement of a label name) and `driver` as value (let's be imaginative...). About the permissions, this would have to be documented: I don't see an issue for a manager to give a pod permission to create services - If security prohibits this, up to the administrator to create them before hand I just feel that we risk to miss some use cases if we don't ship an option for this (the default behaviour could be the safer with the need to create manually the service) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/21748 > Got you points. About labels, right, we could take the road of the code that creates labels on its own pod. To ensure uniqueness, we could use the spark-app-id as key (if it maps the requirement of a label name) and driver as value (let's be imaginative...). Label `spark-app-id` is only set if `spark-submit` goes through the steps to create the driver pod so doesn't apply in this case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r203526021 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/PrefixSpanWrapper.scala --- @@ -0,0 +1,34 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.spark.ml.fpm.PrefixSpan + +private[r] object PrefixSpanWrapper { + def getPrefixSpan( + minSupport: Double, + maxPatternLength: Int, + maxLocalProjDBSize: Double, --- End diff -- ```maxLocalProjDBSize``` is a ```LongParam``` in scala, but it seems that r doesn't have ```long```. See ```help(integer):``` ``` Note that on almost all implementations of R the range of representable integers is restricted to about +/-2*10^9: ‘double’s can hold much larger integers exactly. ``` so I used ```double``` in r, and do ```toLong``` later --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21720: [SPARK-24163][SPARK-24164][SQL] Support column list as t...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/21720 @gatorsmile @maryannxue Can we move forward with this PR: https://github.com/apache/spark/pull/21699 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21794: [SPARK-24834][CORE] use java comparison for float and do...
Github user bavardage commented on the issue: https://github.com/apache/spark/pull/21794 it does seem that spark currently does distinguish -0 and 0, at least as far as groupbys go ``` scala> case class Thing(x : Float) defined class Thing scala> val df = Seq(Thing(0.0f), Thing(-0.0f),Thing(0.0f), Thing(-0.0f),Thing(0.0f), Thing(-0.0f),Thing(0.0f), Thing(-0.0f)).toDF 2018-07-17 13:17:08 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException df: org.apache.spark.sql.DataFrame = [x: float] scala> df.groupBy($"x").count res0: org.apache.spark.sql.DataFrame = [x: float, count: bigint] scala> res0.collect res1: Array[org.apache.spark.sql.Row] = Array([-0.0,4], [0.0,4]) ``` doubles are hashed via `doubleToLongBits` https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L338 which gives different bitwise representations of positive and negative 0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user echarles commented on the issue: https://github.com/apache/spark/pull/21748 > Label spark-app-id is only set if spark-submit goes through the steps to create the driver pod so doesn't apply in this case. In that case, the client process could create its own `spark-client-app-id`... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/21748 > In that case, the client process could create its own spark-client-app-id... Yes, and that's what my point above is about. Regardless of how the driver pod is created and managed, users need to make sure it contains a label that is unique for the pod/service combination to work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21533: [SPARK-24195][Core] Bug fix for local:/ path in SparkCon...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21533 **[Test build #4220 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4220/testReport)** for PR 21533 at commit [`eb46ccf`](https://github.com/apache/spark/commit/eb46ccfec084c2439a26eee38015381f091fe164). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21806: [SPARK-24846][SQL] Made hashCode ExprId independent of j...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/21806 @viirya this current change is only useful when you compare canonicalized plans created on different JVMs. This has come up when we tried to detect changes in plans over spark versions (plan stability) and when people are writing to caches used by multiple clusters. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/21748 I think taking a step back, it seems unwise more so to be making any assumptions about the location in which a driver is running in client mode. Client mode is simply just saying that the application driver is running "locally", wherever locally is. That is why I suggested also to remove the driver's knowledge of the driver pod name and to remove the owner reference concept entirely. As soon as we start to make assumptions in one place, it may (though not necessarily) encourage us to rely on those assumptions to create more forks and complexities in the code in other places. That's a precedent we likely do not want to set. > Yes, and that's what my point above is about. Regardless of how the driver pod is created and managed, users need to make sure it contains a label that is unique for the pod/service combination to work. In one case the requirement for the unique label is explicit - the user has created their service and their pod manually and have assigned the labels accordingly. In another case, the user must know to assign the unique label but they would only know to do so from having read the documentation on deploying Spark. Explicit requirements that are well defined by the API are preferable to implicit requirements. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21794: [SPARK-24834][CORE] use java comparison for float and do...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21794 `spark-sql` suggests that -0 and 0 are considered the same though. `SELECT -0.0 == 0.0;` returns `true`. It's probably essential not to change behavior here, but if performance is the issue, I think the method can be optimized. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/21748 > That is why I suggested also to remove the driver's knowledge of the driver pod name and to remove the owner reference concept entirely. While, not worrying about the driver pod name and owner reference doesn't eliminate the need of a headless service if the driver is indeed running in a pod. So the question is should the k8s scheduler backend be responsible for creating the service or not if the driver is running in a pod. I'm personally leaning towards doing that for the user. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.
Github user echarles commented on the issue: https://github.com/apache/spark/pull/21748 > I'm personally leaning towards doing that for the user. Especially if the user is a data scientist behind his notebook launching a paragraph which is supposed to instanciate a Spark REPL as fast as possible and without having to call his support center and to launch a script. And if a script would be needed, we'd better ship the logic in the core of Spark. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...
Github user bomeng commented on the issue: https://github.com/apache/spark/pull/21638 Either way works for me, but I think since this is not a private method, so people may use it in their own approach. The minimal change will be the best. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21807: [SPARK-24536] Validate that limit clause cannot have a n...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/21807 @NiharS yeah that makes sense. @mauropalsgraaf we missed this today (sorry about that). Can you add the null check (bonus points if you call `eval()` only once), add a test for this case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21638 Except for `binaryFiles`, everything else that needs to change is private to Spark. I know it's public in the bytecode, but only Java callers could accidentally exploit that. Still I don't personally care too much either way, as long as all the unused args are documented, I guess, for completeness. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21582: [SPARK-24576][BUILD] Upgrade Apache ORC to 1.5.2
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/21582 Thank you so much, @gatorsmile . I will proceed. Also, thank you, @viirya and @maropu . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21202: [SPARK-24129] [K8S] Add option to pass --build-arg's to ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21202 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21451 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21451 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1105/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21202: [SPARK-24129] [K8S] Add option to pass --build-ar...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21202 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21451 **[Test build #93250 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93250/testReport)** for PR 21451 at commit [`335e26d`](https://github.com/apache/spark/commit/335e26d168dc99e7317175da8732ff691ff512f2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21451 @mridulm @jerryshao @felixcheung last one in the 2GB block limit series. just rebased to include the updates to https://github.com/apache/spark/pull/21440. I will also run my tests on a cluster here with this: https://github.com/squito/spark_2gb_test/blob/master/src/main/scala/com/cloudera/sparktest/LargeBlocks.scala will report the results from that, probably tomorrow thanks for all the reviews! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/21546 Thanks @HyukjinKwon! Any additional comments @holdenk @sethah @viirya @felixcheung ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18784: [SPARK-21559][Mesos] remove mesos fine-grained mode
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18784 Let's remove it in 3.0 then. We can do it after 2.4 release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21808: [SPARK-21261][DOCS][SQL] SQL Regex document fix
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21808 **[Test build #93241 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93241/testReport)** for PR 21808 at commit [`a444e80`](https://github.com/apache/spark/commit/a444e80183c83c3eb8df1a5527f802506c8d3bc0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21808: [SPARK-21261][DOCS][SQL] SQL Regex document fix
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21808 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93241/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21808: [SPARK-21261][DOCS][SQL] SQL Regex document fix
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21808 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21589 @MaxGekk We are going in circles. I dont think this is a good api to expose currently - the data is available through multiple other means as I detailed and while not a succinct oneliner, it is useable. Not to mention @markhamstra's comment. Unless there is some other compelling reason for introducing this which I have missed; I am -1 on introducing this change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21806: [SPARK-24846][SQL] Made hashCode ExprId independent of j...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21806 @hvanhovell Got it. Thanks for your explanation. LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21761: [SPARK-24771][BUILD]Upgrade Apache AVRO to 1.8.2
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/21761 Hi, @gatorsmile . Could you review this PR? I'm wondering if we can have this for Spark 2.4 before branch-cut. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r203538149 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/PrefixSpanWrapper.scala --- @@ -0,0 +1,34 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.spark.ml.fpm.PrefixSpan + +private[r] object PrefixSpanWrapper { + def getPrefixSpan( + minSupport: Double, + maxPatternLength: Int, + maxLocalProjDBSize: Double, --- End diff -- Ah, I see. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21810: [SPARK-24854][SQL] Gathering all Avro options int...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/21810 [SPARK-24854][SQL] Gathering all Avro options into the AvroOptions class ## What changes were proposed in this pull request? In the PR, I propose to put all `Avro` options in new class `AvroOptions` in the same way as for other datasources `JSON` and `CSV`. ## How was this patch tested? It was tested by `AvroSuite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 avro-options Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21810.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21810 commit 9da3a98d4e68697a0f4ab2b21be4b6aebeeb25e3 Author: Maxim Gekk Date: 2018-07-18T18:50:37Z Adding a class for Avro Options commit 3a76ba2de635fb6f653ee1d814ed0b87bcb12fb6 Author: Maxim Gekk Date: 2018-07-18T21:37:08Z Gathering all options in AvroOptions --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21810: [SPARK-24854][SQL] Gathering all Avro options into the A...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21810 **[Test build #93251 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93251/testReport)** for PR 21810 at commit [`3a76ba2`](https://github.com/apache/spark/commit/3a76ba2de635fb6f653ee1d814ed0b87bcb12fb6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21798: [SPARK-24836][SQL] New option for Avro datasource - igno...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/21798 Please, look at this PR: https://github.com/apache/spark/pull/21810 . It introduces `AvroOptions`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21810: [SPARK-24854][SQL] Gathering all Avro options into the A...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21810 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21810: [SPARK-24854][SQL] Gathering all Avro options into the A...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21810 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/21589 > it's not terribly useful to know, e.g., that there are 5 million cores in the cluster if your Job is running in a scheduler pool that is restricted to using far fewer CPUs via the pool's maxShares Is `defaultParallelism` more useful in the case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user ssimeonov commented on the issue: https://github.com/apache/spark/pull/21589 @mridulm your comments make an implicit assumption, which is quite incorrect: that Spark users read the Spark codebase and/or are aware of Spark internals. Please, consider this PR in the context of its intended audience who (a) do not read the source code and (b) hardly look at the API docs. What they read are things like Stack Overflow, the Databricks Guide, blog posts and (quite rarely) the occasional how-to-with-Spark book. The fact that something is possible with Spark doesn't make it easy or intuitive. The value of this PR is that it makes a common use case easy and intuitive. Let's consider the practicality of your suggestions: > Rely on defaultParallelism - this gives the expected result, unless explicitly overridden by user. That doesn't address the core use case as the scope of change & effect is very different. In the targeted use cases, a user wants to explicitly control the level of parallelism relative to the current cluster physical state for potentially a single stage. Relying on `defaultParallelism` exposes the user to undesired side-effects as the setting can be changed by other, potentially unrelated code the user has no control over. Introducing unintended side effects, which your suggestion does, is poor design. > If you need fine grained information about executors, use spark listener (it is trivial to keep a count with onExecutorAdded/onExecutorRemoved). I'd suggest you reconsider your definition of "trivial". Normal Spark users, not people who work on Spark or at companies like Hortonworks whose job is to be Spark experts, have no idea what a listener is, have never hooked one and never will. Not to mention how much fun it is to do this from, say, R. > If you simply want a current value without own listener - use REST api to query for current executors. This type of suggestion is a prime example of ignoring Spark user concerns. You are comparing `sc.numExecutors` with: 1. Knowing that a REST API exists that can produce this result. 2. Learning the details of the API. 3. Picking a synchronous REST client in the language they are using Spark with. 4. Initializing the REST client with the correct endpoint which they obtain... somehow. 5. Formulating the request. 6. Parsing the response. I don't think there is any need to say more about this suggestion. Taking a step back, it is important to acknowledge that Spark has become a mass-market data platform product and start designing user-facing APIs with this in mind. If the teams I know are any indication, the majority of Spark users are not experienced backend/data engineers. They are data scientists and data hackers: people who are getting into big data via Spark. The imbalance is only going to grow. The criteria by which user-focused Spark APIs are evaluated should evolve accordingly. From an ease-of-use perspective, I'd argue the two new methods should be exposed to `SparkSession` also as this is the typical new user "entry point". For example, the data scientists on my team never use `SparkContext` but they do adjust stage parallelism via implicits equivalent to the ones proposed in this PR (to significant benefit in query execution performance). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21803: [SPARK-24849][SQL] Converting a value of StructType to a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21803 **[Test build #93242 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93242/testReport)** for PR 21803 at commit [`f302777`](https://github.com/apache/spark/commit/f302777461e258e59e3158ec44cd78a43ae8758c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21803: [SPARK-24849][SQL] Converting a value of StructType to a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21803 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21803: [SPARK-24849][SQL] Converting a value of StructType to a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21803 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93242/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21810: [SPARK-24854][SQL] Gathering all Avro options into the A...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21810 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93251/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21810: [SPARK-24854][SQL] Gathering all Avro options into the A...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21810 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21810: [SPARK-24854][SQL] Gathering all Avro options into the A...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21810 **[Test build #93251 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93251/testReport)** for PR 21810 at commit [`3a76ba2`](https://github.com/apache/spark/commit/3a76ba2de635fb6f653ee1d814ed0b87bcb12fb6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21796: [SPARK-24833][K8S][WIP] Add host name aliases feature
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/21796 I think we decided not to take any new configuration options with https://issues.apache.org/jira/browse/SPARK-24434 being worked on. @mccheah @foxish. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21795 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21584: [SPARK-24433][K8S] Initial R Bindings for SparkR on K8s
Github user shaneknapp commented on the issue: https://github.com/apache/spark/pull/21584 just wanted to briefly chime in: TL;DR: this build will fail until the PRB is running on our ubuntu build nodes. we are currently blocked from testing this stuff w/the current iteration of the PRB on our jenkins for a couple of reasons: 1) the current pull request builder runs only on our old, ancient and crufty centos 6.9 hosts 2) these hosts have old, ancient and crufty installations of R (3.1.1) and associated libraries 3) trying to re-create (2) on our shiny, new ubuntu 16.04 boxen has, so far, been an exercise in futility. :\ 4) this is why we currently have a limited number of ubuntu boxes, and also why things like testing for spark/k8s integration requires an addition build *on top* of the standard PRB build to make things work. i'm working now w/some folks at databricks (@weshsiao) to get this under control enough so that we can get things moving to the ubuntu hosts ASAP. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/21589 No, defaultParallelism isn't more useful in that case, but that just starts getting to my overall assessment of this JIRA and PR: It smells of defining the problem to align with a preconception of the solution. Exposing the driver's current accounting of the number of cores active in the cluster is not something that we couldn't do or didn't know how to do along time ago. Rather, it is something that those of us working on the scheduler chose not to do because of the expectation that putting this in the public API (and thereby implicitly encouraging its use) was likely to produce as many problems as it solves. This was primarily because of two factors: 1) The number of cores and executors is not static; 2) Closely tailoring a Job to some expectation of the number of available cores or executors is not obviously a correct thing to encourage in general. Whether from node failures, dynamic executor allocation, backend scheduler elasticity/preemption, or just other Jobs running under the same SparkContext, the number of cores and executors available to any particular Job when it is created can easily be different from what is available when any of its Stages actually runs. Even if you could get reliable numbers for the cores and executors that will be available through the lifecycle of a Job, tailoring a Job to use all of those cores and executors is only the right thing to do in a subset of Spark use cases. For example, using many more executors than there are DFS partitions holding the data, or trying to use all of the cores when there are other Jobs pending, or trying to use all of the cores when another Job needs to acquire a minimum number for barrier scheduled execution, or trying to use more cores than a scheduling pool permits would all be examples of anti-patterns that would be more enabled by easy, context-free access to low-level numCores. There definitely are use cases where users need to be able to set policy for whether particular jobs should be encouraged to use more or less of the cluster's resources, but I believe that that needs to be done at a much higher level of abstraction in a declarative form, and that policy likely needs to be enforced dynamically/adaptively at Stage boundaries. The under-developed and under-used dynamic shuffle partitioning code in Spark SQL starts to go in that direction. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user ssimeonov commented on the issue: https://github.com/apache/spark/pull/21589 @markhamstra the purpose of this PR is not to address the topic of dynamic resource management in arbitrarily complex Spark environments. Most Spark users do not operate in such environments. It is to help simple Spark users refactor code such as ```scala df.repartition(25) // and related repartition() + coalesce() variants ``` to make job execution take advantage of additional cores, when they are available. Asking for a greater degree of parallelism than the cores a job has available rarely has significant negative effects (for reasonable values). Asking for a low degree of parallelism when there are lots of cores available has significant negative effects, especially in the common real-world use cases where there is lots of data skew. That's the point that both you and @mridulm seem to be missing. The arguments about resources flexing during job execution to do change this. My team has used this simple technique for years on both static and autoscaling clusters and we've seen meaningful performance improvements in both ETL and ML/AI-related data production for data ranging from gigabytes to petabytes. The idea is simple enough that even data scientists can (and do) easily use it. That's the benefit of this PR and that's why I like it. The cost of this PR is adding two simple & clear methods. The cost-benefit analysis seems obvious. I agree with you that lots more can be done to handle the general case of better matching job resource needs to cluster/pool resources. This work is going to take forever given the current priorities. Let's not deny the majority of Spark users simple & real execution benefits while we dream about amazing architectural improvements. When looking at the net present value of performance, the discount factor is large. Performance improvements now are a lot more valuable than performance improvements in the far future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/21589 @ssimeonov the purpose of a public API is not to offer hack solutions to a subset of problems. What is needed is a high-level, declarative abstraction that can be used to specify requested Job resource-usage policy. Exposing low-level Spark scheduling internals is not the way to achieve that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21795 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21804 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21804 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1106/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21804 **[Test build #93252 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93252/testReport)** for PR 21804 at commit [`5ffc793`](https://github.com/apache/spark/commit/5ffc793e8fadc0c7c8a1137d227c187f88be6ef5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user ssimeonov commented on the issue: https://github.com/apache/spark/pull/21589 @markhamstra even the words you are using indicate that you are missing the intended audience. > high-level, declarative abstraction that can be used to specify requested Job resource-usage policy How exactly do you imagine data scientists using something like this as they hack in a Jupyter or Databricks notebook in Python to sample data from a 10Tb dataset? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org