date:20180718

[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

2018-07-18 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20146
  
Thanks. I rebased it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21533: [SPARK-24195][Core] Bug fix for local:/ path in SparkCon...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21533
  
**[Test build #4219 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4219/testReport)**
 for PR 21533 at commit 
[`eb46ccf`](https://github.com/apache/spark/commit/eb46ccfec084c2439a26eee38015381f091fe164).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21807: [SPARK-24536] Validate that limit clause cannot have a n...

2018-07-18 Thread NiharS

Github user NiharS commented on the issue:

https://github.com/apache/spark/pull/21807
  
New to SQL, but it seems like the query

`SELECT 1 LIMIT CAST('1' AS INT)`

should work, right? I tried both on Spark without to your change and the 
W3Schools SQL tester and it's accepted in both, but I tried on your PR and it 
hits the new AnalysisException.

If this is indeed an issue, it can be avoided by having the case be 
`e.eval() == null` instead of `e.nullable`, although there's some duplicated 
work since the previous case also calls `e.eval()`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20146
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1104/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20146
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21447: [SPARK-24339][SQL]Add project for transform/map/reduce s...

2018-07-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21447
  
@maropu Do you want to take this over and add such a project in 
`ColumnPruning`?  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21656: [SPARK-24677][Core]Avoid NoSuchElementException from Med...

2018-07-18 Thread tgravescs

Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21656
  
merged thanks @cxzl25 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20146
  
**[Test build #93246 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93246/testReport)**
 for PR 20146 at commit 
[`c003bd3`](https://github.com/apache/spark/commit/c003bd3d6c58cf19249ff0ba9dd10140971d655c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21635: [SPARK-24594][YARN] Introducing metrics for YARN

2018-07-18 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/21635#discussion_r203495430
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterSource.scala
 ---
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.yarn
+
+import com.codahale.metrics.{Gauge, MetricRegistry}
+
+import org.apache.spark.metrics.source.Source
+
+private[spark] class ApplicationMasterSource(yarnAllocator: YarnAllocator) 
extends Source {
+
+  override val sourceName: String = "applicationMaster"
--- End diff --

the config spark.metrics.namespace already exists.  see the metrics section 
in http://spark.apache.org/docs/latest/monitoring.html.  But if you look at the 
code 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L129
 its only applied for executor and driver metrics.  I think we should have it 
apply to the yarn metrics as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21468: [SPARK-22151] : PYTHONPATH not picked up from the spark....

2018-07-18 Thread tgravescs

Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21468
  
merged thanks @pgandhi999 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21742: [SPARK-24768][SQL] Have a built-in AVRO data sour...

2018-07-18 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21742#discussion_r203496489
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala ---
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+package object avro {
+  /**
+   * Adds a method, `avro`, to DataFrameWriter that allows you to write 
avro files using
+   * the DataFileWriter
+   */
+  implicit class AvroDataFrameWriter[T](writer: DataFrameWriter[T]) {
+def avro: String => Unit = writer.format("avro").save
--- End diff --

I'd remove the Scala API.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21468: [SPARK-22151] : PYTHONPATH not picked up from the...

2018-07-18 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21468


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

2018-07-18 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20146
  
Yeah, looks appveyer tests are triggered.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...

2018-07-18 Thread edwinalu

Github user edwinalu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21221#discussion_r203503691
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala ---
@@ -160,11 +160,29 @@ case class 
SparkListenerBlockUpdated(blockUpdatedInfo: BlockUpdatedInfo) extends
  * Periodic updates from executors.
  * @param execId executor id
  * @param accumUpdates sequence of (taskId, stageId, stageAttemptId, 
accumUpdates)
+ * @param executorUpdates executor level metrics updates
  */
 @DeveloperApi
 case class SparkListenerExecutorMetricsUpdate(
 execId: String,
-accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])])
+accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])],
+executorUpdates: Option[Array[Long]] = None)
+  extends SparkListenerEvent
+
+/**
+ * Peak metric values for the executor for the stage, written to the 
history log at stage
+ * completion.
+ * @param execId executor id
+ * @param stageId stage id
+ * @param stageAttemptId stage attempt
+ * @param executorMetrics executor level metrics, indexed by 
MetricGetter.values
+ */
+@DeveloperApi
+case class SparkListenerStageExecutorMetrics(
+execId: String,
+stageId: Int,
+stageAttemptId: Int,
+executorMetrics: Array[Long])
--- End diff --

This sounds like a good solution, with both a clean API, and also an 
efficient implementation that will be easier to add new metrics to. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21710
  
**[Test build #93245 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93245/testReport)**
 for PR 21710 at commit 
[`ec88d38`](https://github.com/apache/spark/commit/ec88d38f3bafb9740c7c9a3856ddc3a64cb16fe7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21795
  
**[Test build #93239 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93239/testReport)**
 for PR 21795 at commit 
[`c83eeeb`](https://github.com/apache/spark/commit/c83eeebc32d449412cc53951ad5b9611ac5f3268).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21710
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93245/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21710
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21795
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21795
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93239/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread echarles

Github user echarles commented on the issue:

https://github.com/apache/spark/pull/21748
  
> Note that we only invoke any of the feature steps and the entry point of 
KubernetesClientApplication if we run in cluster mode. If we run in client 
mode, we enter directly into the user's main class, or, the user is in a 
process that just created a Spark context from scratch with the right master 
URL (i.e. new SparkContext(k8s://my-master:8443)). If you wanted to create the 
headless service in client mode, you'd have to do it when instantiating the 
KubernetesClusterSchedulerBackend somewhere, probably before creating the 
ExecutorPodsAllocator so that you'd set spark.driver.host and spark.driver.port 
properly when telling the created executors where to find the driver via the 
--driver-url argument.

For Out-Cluster we don't need any headless service, but for In-Cluster the 
user has to ensure that the driver is accessible via a service (headless or 
not) - I understand it may be preferable to align on spark other managers and 
to have coherent behaviour for k8s in/out cluster, but I still believe that we 
have to make life easy to users and devops.

The notebook case is self explaining : For Zeppelin you may have a single 
pod (Web Server) that will create multiple client sessions (this is a common 
supported scenario), Jupyter with Toree may be configured to support multiple 
REPL (and if it not today, it will come on day). You also have the case of 
current livy which is a single REST Server managing a pool of REPL.

With current PR, we push to all those 3rd party consumer the responsibility 
to create services, but all the hard work to assign and maintain ports to avoid 
collision on the single pod with multiple drivers.

What about having a property that would enable/disable that headless 
service creation (could be disabled by default) - This is not yet the perfect 
solution as the user still would have to pass the ports and ensure there is no 
clash.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21809: [SPARK-24851] : Map a Stage ID to it's Associated Job ID...

2018-07-18 Thread tgravescs

Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21809
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread mridulm

Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/21589
  
@MaxGekk The example you cites is literally one of a handful of usages 
which is not easily overridden - and is prefixed with a 'HACK ALERT' ! A few 
others are in mllib, typically for reading schema.

I will reiterate the solutions available to users currently:
* Rely on `defaultParallelism` - this gives the expected result, unless 
explicitly overridden by user.
* If you need fine grained information about executors, use spark listener 
(it is trivial to keep a count with `onExecutorAdded`/`onExecutorRemoved`).
* If you simply want a current value without own listener - use REST api to 
query for current executors.

Having said this, I will caution against this approach if you are concerned 
about performance. `defaultParallelism` exists to give a default when user does 
not explicitly override when creating an `RDD` : and reflects the current 
number of executors.
Particularly when dynamic resource allocation is enabled, this value is not 
optimal : spark will acquire or release resources based on pending tasks.

Using available cluster resources (from cluster manager - not spark) as a 
way to model parallelism would be a better approach : externalize your config's 
and populate based on resources available to application (in your example: 
difference between test/staging/production).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21809: [SPARK-24851] : Map a Stage ID to it's Associated Job ID...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21809
  
**[Test build #93247 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93247/testReport)**
 for PR 21809 at commit 
[`7be0520`](https://github.com/apache/spark/commit/7be05201239c9fec7133e8f09455224786e91b59).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread echarles

Github user echarles commented on the issue:

https://github.com/apache/spark/pull/21748
  
PS: Actually, there would even be no issue with the port assignment as 
Spark knows which ports he will be using, so he can create the headless service 
with the correct ports for the user.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21806: [SPARK-24846][SQL] Made hashCode ExprId independent of j...

2018-07-18 Thread gvr

Github user gvr commented on the issue:

https://github.com/apache/spark/pull/21806
  
Updated description, thanks @hvanhovell 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread mccheah

Github user mccheah commented on the issue:

https://github.com/apache/spark/pull/21748
  
@echarles I don't think we should be special-casing Kubernetes here as 
being any different from the other cluster managers. The main point of client 
mode is that the driver is running locally and we make no assumptions about 
network connectivity. Deploying notebooks on other cluster managers which have 
network isolation between the driver and the executors would require the user 
to manually configure their own connectivity too.

There's another problem in creating the headless service here: when you 
create a service, you have to back that service with pods that have certain 
labels; that is, a label selector. However, labels are not unique for a given 
pod; multiple pods can have the same label key-value pairs. Now, when Spark 
Submit has explicit control over the pod that is created as well as the 
service, we make it such that the driver pod has a unique label corresponding 
to an application id, and then the service can be ensured to be backed by only 
that driver pod. If we don't control the creation of the driver pod, we can't 
guarantee that whatever service we create would route to only the driver pod 
running the application in question.

It is therefore unclear what the label selector would be for the headless 
service we create automatically. You could allow the user to have some control 
over the configuration of the headless service. For example, we could have a 
setting indicating the label selector the headless service should be created 
with. Or, the documentation can specifically require the driver pod to have a 
set of labels which are strictly unique to that pod in that namespace. In both 
cases, I would argue that we end up with the same complexity and mental burden, 
if not more so, than having the user deploy a headless service that gives their 
driver pod (and only that driver pod) a stable hostname.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread mccheah

Github user mccheah commented on the issue:

https://github.com/apache/spark/pull/21748
  
Though I suppose you could have the driver patch its own metadata fields to 
assign itself a unique label. I could see that being confusing to users when 
they observe that their driver pod metadata is modified from what they 
originally deployed. Nevertheless, it's a tenable option.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21806: [SPARK-24846][SQL] Made hashCode ExprId independent of j...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21806
  
**[Test build #93240 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93240/testReport)**
 for PR 21806 at commit 
[`68d6f19`](https://github.com/apache/spark/commit/68d6f19ad9baf78f7ffba803e7709a509063d1ff).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class ExprId(id: Long, jvmId: UUID) `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21806: [SPARK-24846][SQL] Made hashCode ExprId independent of j...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21806
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93240/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21806: [SPARK-24846][SQL] Made hashCode ExprId independent of j...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21806
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread echarles

Github user echarles commented on the issue:

https://github.com/apache/spark/pull/21748
  
@mccheah If I compare with yarn-client with all nodes on the same LAN, we 
introduce complexity here as the user has to ensure not only configuration, but 
also deployment of a particular resource. I am not sure that yarn devops would 
be happy to be obliged to deploy let's say a yarn container or yarn app for 
each notebook before being able to use it. We do introduce complexity with 
k8s-client compared to yarn-client.

About selecting the pod with labels, another approach I have taken is 
simply using the name of the driver pod, a bit like I have done with the 
following deployment (so no need to ensure labels - the ports are the ports 
assigned by spark that the code can retrieve).

```
apiVersion: v1
kind: Service
metadata:
  name: spark-driver-service
spec:
  clusterIP: None
  ports:
  - port: 7077
name: spark-driver-port
  - port: 1
name: spark-driver-blockmanager-port
  selector:
run: spark-pod
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21720: [SPARK-24163][SPARK-24164][SQL] Support column list as t...

2018-07-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21720
  
LGTM

Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...

2018-07-18 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/21221#discussion_r203520320
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala ---
@@ -160,11 +160,29 @@ case class 
SparkListenerBlockUpdated(blockUpdatedInfo: BlockUpdatedInfo) extends
  * Periodic updates from executors.
  * @param execId executor id
  * @param accumUpdates sequence of (taskId, stageId, stageAttemptId, 
accumUpdates)
+ * @param executorUpdates executor level metrics updates
  */
 @DeveloperApi
 case class SparkListenerExecutorMetricsUpdate(
 execId: String,
-accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])])
+accumUpdates: Seq[(Long, Int, Int, Seq[AccumulableInfo])],
+executorUpdates: Option[Array[Long]] = None)
+  extends SparkListenerEvent
+
+/**
+ * Peak metric values for the executor for the stage, written to the 
history log at stage
+ * completion.
+ * @param execId executor id
+ * @param stageId stage id
+ * @param stageAttemptId stage attempt
+ * @param executorMetrics executor level metrics, indexed by 
MetricGetter.values
+ */
+@DeveloperApi
+case class SparkListenerStageExecutorMetrics(
+execId: String,
+stageId: Int,
+stageAttemptId: Int,
+executorMetrics: Array[Long])
--- End diff --

e ... actually just realized EnumMap is always mutable.  If we want the 
public api to be totally immutable, back to private field & constructor, and a 
getter.  so might as well keep the field an `Array[Long]` since its private 
anyway.  IMO doesn't really matter if `ExecutorMetricType` is an enum or a 
sealed scala trait.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread markhamstra

Github user markhamstra commented on the issue:

https://github.com/apache/spark/pull/21589
  
@mridulm scheduler pools could also make the cluster-wide resource numbers 
not very meaningful. I don't think the maxShare work has been merged yet (kind 
of a stalled TODO on an open PR, IIRC), but once that is in, it's not terribly 
useful to know, e.g., that there are 5 million cores in the cluster if your Job 
is running in a scheduler pool that is restricted to using far fewer CPUs via 
the pool's maxShares.   


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread mccheah

Github user mccheah commented on the issue:

https://github.com/apache/spark/pull/21748
  
> About selecting the pod with labels, another approach I have taken is 
simply using the name of the driver pod, a bit like I have done with the 
following deployment (so no need to ensure labels - the ports are the ports 
assigned by spark that the code can retrieve).

I don't think you can back a service with a selector that's a pod's name, 
but someone with more knowledge of the Service API might be able to correct me 
here. I was under the impression one had to use labels. In your example, the 
service would match any pod with the label key of `run` being equal to 
`spark-pod`, which isn't guaranteed to map to a single unique pod. In 
spark-submit we set `spark-app-id` to a unique identifier.

> If I compare with yarn-client with all nodes on the same LAN

But if you run a YARN application with the driver not being on the same 
network, then the user has to set up their own connectivity. In Kubernetes that 
kind of networking setup happens to come up more often, perhaps, but it's not 
enough reason to introduce the complexity of it.

Another situation where we want the driver to not be making the headless 
service is in a world where the driver shouldn't have permission to create 
service objects, but can have permission to create pod objects. Adding a flag 
allowing the driver to create the headless service would implicitly change the 
required permissions of the application. This is more work to document and more 
for the application writer to consider.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21720: [SPARK-24163][SPARK-24164][SQL] Support column li...

2018-07-18 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21720


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21729: [SPARK-24755][Core] Executor loss can cause task to not ...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21729
  
**[Test build #93248 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93248/testReport)**
 for PR 21729 at commit 
[`f9ed226`](https://github.com/apache/spark/commit/f9ed2260eb0934820bbcda2229fb7406efd5c004).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread liyinan926

Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/21748
  
> I don't think you can back a service with a selector that's a pod's name, 
but someone with more knowledge of the Service API might be able to correct me 
here. I was under the impression one had to use labels.

Yes, the service gets its endpoints by matching its label selector against 
labels on the pods so it's critical to have matching labels. Another tenable 
solution is for the driver backend code to get the pod object of the driver and 
sets the label selector of the service based on the labels on the driver pod.

>  don't think we should be special-casing Kubernetes here as being any 
different from the other cluster managers. The main point of client mode is 
that the driver is running locally and we make no assumptions about network 
connectivity. Deploying notebooks on other cluster managers which have network 
isolation between the driver and the executors would require the user to 
manually configure their own connectivity too.

I would back-step a bit and argue that Kubernetes is unique in this regard 
because of the need of a service for in-cluster client mode to work and the 
service is a Kubernetes resource. For this reason, I don't think it makes no 
sense for us to do a bit more for the users if technically it's not impossible 
to do so.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread mccheah

Github user mccheah commented on the issue:

https://github.com/apache/spark/pull/21748
  
> Yes, the service gets its endpoints by matching its label selector 
against labels on the pods so it's critical to have matching labels. Another 
tenable solution is for the driver backend code to get the pod object of the 
driver and sets the label selector of the service based on the labels on the 
driver pod.

The problem is that the driver's labels might not be unique to that driver, 
which therefore would require the user to assign their own unique labels or for 
us to patch the driver pod in-place to assign it a unique label. Neither seem 
ideal.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21729: [SPARK-24755][Core] Executor loss can cause task to not ...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21729
  
**[Test build #93249 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93249/testReport)**
 for PR 21729 at commit 
[`6316e5b`](https://github.com/apache/spark/commit/6316e5b9823dfe0b04b4521a9ef7d9516e9b8289).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21806: [SPARK-24846][SQL] Made hashCode ExprId independent of j...

2018-07-18 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21806
  
The change looks fine. However I'm wondering that have we have chance to 
compare hash code between expr ids from different jvms?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread liyinan926

Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/21748
  
> The problem is that the driver's labels might not be unique to that 
driver, which therefore would require the user to assign their own unique 
labels or for us to patch the driver pod in-place to assign it a unique label. 
Neither seem ideal.

This exact problem still exists if users would have to create the service 
themselves: they would still need to make sure to use label(s) that can tell 
the driver/service combination apart from other apps.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/21589
  
> ... unless explicitly overridden by user.

This is the problem this PR addresses, actually.

> If you need fine grained information about executors, use spark listener 
(it is trivial to keep a count with onExecutorAdded/onExecutorRemoved)

Do you really believe this is right approach? Instead of 
`spark.sparkContext.numCores` you propose to our users to figure out how to 
register a listener, store somewhere results (in thread safe manner?) and keep 
it updated. Seriously? 

> If you simply want a current value without own listener - use REST api to 
query for current executors.

I hope you know the POLA principle: 
https://en.wikipedia.org/wiki/Principle_of_least_astonishment . Could you 
imagine you are writing some code in local or remote notebook in R. First of 
all are you sure the REST API is available to users? Comparing to one line 
call, how many lines and effort from users may take calling the REST API and 
parsing results? Highly likely users will just put some constant (like you 
proposed even from config) and will get overloaded/underloaded cluster.

> defaultParallelism exists to give a default when user does not explicitly 
override when creating an RDD : and reflects the current number of executors.

One more time, the `numCores()` method aims to solve the problem when 
`defaultParallelism` is set explicitly. As I show you on an use case above, the 
`defaultParallelism` can be changed in one part of app (for example in some 
library - not available to user) and number of cores is needed in another part.

> externalize your config's and populate based on resources available to 
application

I have already described you one use cases when a notebook can be attached 
to many clusters. The same notebook can be reused/called from another 
notebooks. This _"externalize your config"_ could sound perfect from 
developer's perspective but not from customer's .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread echarles

Github user echarles commented on the issue:

https://github.com/apache/spark/pull/21748
  
Got you points. About labels, right, we could take the road of the code 
that creates labels on its own pod. To ensure uniqueness, we could use the 
`spark-app-id` as key (if it maps the requirement of a label name) and `driver` 
as value (let's be imaginative...).

About the permissions, this would have to be documented: I don't see an 
issue for a manager to give a pod permission to create services - If security 
prohibits this, up to the administrator to create them before hand

I just feel that we risk to miss some use cases if we don't ship an option 
for this (the default behaviour could be the safer with the need to create 
manually the service)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread liyinan926

Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/21748
  
> Got you points. About labels, right, we could take the road of the code 
that creates labels on its own pod. To ensure uniqueness, we could use the 
spark-app-id as key (if it maps the requirement of a label name) and driver as 
value (let's be imaginative...).

Label `spark-app-id` is only set if `spark-submit` goes through the steps 
to create the driver pod so doesn't apply in this case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-18 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21710#discussion_r203526021
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/r/PrefixSpanWrapper.scala ---
@@ -0,0 +1,34 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.spark.ml.fpm.PrefixSpan
+
+private[r] object PrefixSpanWrapper {
+  def getPrefixSpan(
+  minSupport: Double,
+  maxPatternLength: Int,
+  maxLocalProjDBSize: Double,
--- End diff --

```maxLocalProjDBSize``` is a ```LongParam``` in scala, but it seems that r 
doesn't have ```long```. 
See ```help(integer):```
```
 Note that on almost all implementations of R the range of
 representable integers is restricted to about +/-2*10^9: âdoubleâs
 can hold much larger integers exactly.
```
so I used ```double``` in r, and do ```toLong``` later


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21720: [SPARK-24163][SPARK-24164][SQL] Support column list as t...

2018-07-18 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/21720
  
@gatorsmile @maryannxue Can we move forward with this PR: 
https://github.com/apache/spark/pull/21699 ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21794: [SPARK-24834][CORE] use java comparison for float and do...

2018-07-18 Thread bavardage

Github user bavardage commented on the issue:

https://github.com/apache/spark/pull/21794
  
it does seem that spark currently does distinguish -0 and 0, at least as 
far as groupbys go

```
scala> case class Thing(x : Float)
defined class Thing

scala> val df = Seq(Thing(0.0f), Thing(-0.0f),Thing(0.0f), 
Thing(-0.0f),Thing(0.0f), Thing(-0.0f),Thing(0.0f), Thing(-0.0f)).toDF
2018-07-17 13:17:08 WARN  ObjectStore:568 - Failed to get database 
global_temp, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [x: float]

scala> df.groupBy($"x").count
res0: org.apache.spark.sql.DataFrame = [x: float, count: bigint]

scala> res0.collect
res1: Array[org.apache.spark.sql.Row] = Array([-0.0,4], [0.0,4])
```

doubles are hashed via `doubleToLongBits` 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L338
 which gives different bitwise representations of positive and negative 0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread echarles

Github user echarles commented on the issue:

https://github.com/apache/spark/pull/21748
  
> Label spark-app-id is only set if spark-submit goes through the steps to 
create the driver pod so doesn't apply in this case.

In that case, the client process could create its own 
`spark-client-app-id`...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread liyinan926

Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/21748
  
> In that case, the client process could create its own 
spark-client-app-id...

Yes, and that's what my point above is about. Regardless of how the driver 
pod is created and managed, users need to make sure it contains a label that is 
unique for the pod/service combination to work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21533: [SPARK-24195][Core] Bug fix for local:/ path in SparkCon...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21533
  
**[Test build #4220 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4220/testReport)**
 for PR 21533 at commit 
[`eb46ccf`](https://github.com/apache/spark/commit/eb46ccfec084c2439a26eee38015381f091fe164).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21806: [SPARK-24846][SQL] Made hashCode ExprId independent of j...

2018-07-18 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/21806
  
@viirya this current change is only useful when you compare canonicalized 
plans created on different JVMs. This has come up when we tried to detect 
changes in plans over spark versions (plan stability) and when people are 
writing to caches used by multiple clusters.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread mccheah

Github user mccheah commented on the issue:

https://github.com/apache/spark/pull/21748
  
I think taking a step back, it seems unwise more so to be making any 
assumptions about the location in which a driver is running in client mode. 
Client mode is simply just saying that the application driver is running 
"locally", wherever locally is. That is why I suggested also to remove the 
driver's knowledge of the driver pod name and to remove the owner reference 
concept entirely. As soon as we start to make assumptions in one place, it may 
(though not necessarily) encourage us to rely on those assumptions to create 
more forks and complexities in the code in other places. That's a precedent we 
likely do not want to set.

> Yes, and that's what my point above is about. Regardless of how the 
driver pod is created and managed, users need to make sure it contains a label 
that is unique for the pod/service combination to work.

In one case the requirement for the unique label is explicit - the user has 
created their service and their pod manually and have assigned the labels 
accordingly. In another case, the user must know to assign the unique label but 
they would only know to do so from having read the documentation on deploying 
Spark. Explicit requirements that are well defined by the API are preferable to 
implicit requirements.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21794: [SPARK-24834][CORE] use java comparison for float and do...

2018-07-18 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21794
  
`spark-sql` suggests that -0 and 0 are considered the same though. `SELECT 
-0.0 == 0.0;` returns `true`. It's probably essential not to change behavior 
here, but if performance is the issue, I think the method can be optimized.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread liyinan926

Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/21748
  
> That is why I suggested also to remove the driver's knowledge of the 
driver pod name and to remove the owner reference concept entirely. 

While, not worrying about the driver pod name and owner reference doesn't 
eliminate the need of a headless service if the driver is indeed running in a 
pod. So the question is should the k8s scheduler backend be responsible for 
creating the service or not if the driver is running in a pod. I'm personally 
leaning towards doing that for the user.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21748: [SPARK-23146][K8S] Support client mode.

2018-07-18 Thread echarles

Github user echarles commented on the issue:

https://github.com/apache/spark/pull/21748
  
>  I'm personally leaning towards doing that for the user.

Especially if the user is a data scientist behind his notebook launching a 
paragraph which is supposed to instanciate a Spark REPL as fast as possible and 
without having to call his support center and to launch a script. And if a 
script would be needed, we'd better ship the logic in the core of Spark.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

2018-07-18 Thread bomeng

Github user bomeng commented on the issue:

https://github.com/apache/spark/pull/21638
  
Either way works for me, but I think since this is not a private method, so 
people may use it in their own approach. The minimal change will be the best. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21807: [SPARK-24536] Validate that limit clause cannot have a n...

2018-07-18 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/21807
  
@NiharS yeah that makes sense. @mauropalsgraaf we missed this today (sorry 
about that). Can you add the null check (bonus points if you call `eval()` only 
once), add a test for this case?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

2018-07-18 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21638
  
Except for `binaryFiles`, everything else that needs to change is private 
to Spark. I know it's public in the bytecode, but only Java callers could 
accidentally exploit that. Still I don't personally care too much either way, 
as long as all the unused args are documented, I guess, for completeness.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21582: [SPARK-24576][BUILD] Upgrade Apache ORC to 1.5.2

2018-07-18 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/21582
  
Thank you so much, @gatorsmile . I will proceed.
Also, thank you, @viirya and @maropu .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21202: [SPARK-24129] [K8S] Add option to pass --build-arg's to ...

2018-07-18 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/21202
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21451
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21451
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1105/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21202: [SPARK-24129] [K8S] Add option to pass --build-ar...

2018-07-18 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21202


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21451
  
**[Test build #93250 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93250/testReport)**
 for PR 21451 at commit 
[`335e26d`](https://github.com/apache/spark/commit/335e26d168dc99e7317175da8732ff691ff512f2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-07-18 Thread squito

Github user squito commented on the issue:

https://github.com/apache/spark/pull/21451
  
@mridulm @jerryshao @felixcheung last one in the 2GB block limit series.  
just rebased to include the updates to 
https://github.com/apache/spark/pull/21440.  I will also run my tests on a 
cluster here with this: 
https://github.com/squito/spark_2gb_test/blob/master/src/main/scala/com/cloudera/sparktest/LargeBlocks.scala
will report the results from that, probably tomorrow

thanks for all the reviews!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...

2018-07-18 Thread BryanCutler

Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/21546
  
Thanks @HyukjinKwon! Any additional comments @holdenk @sethah @viirya 
@felixcheung ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18784: [SPARK-21559][Mesos] remove mesos fine-grained mode

2018-07-18 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18784
  
Let's remove it in 3.0 then. We can do it after 2.4 release.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21808: [SPARK-21261][DOCS][SQL] SQL Regex document fix

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21808
  
**[Test build #93241 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93241/testReport)**
 for PR 21808 at commit 
[`a444e80`](https://github.com/apache/spark/commit/a444e80183c83c3eb8df1a5527f802506c8d3bc0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21808: [SPARK-21261][DOCS][SQL] SQL Regex document fix

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21808
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93241/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21808: [SPARK-21261][DOCS][SQL] SQL Regex document fix

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21808
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread mridulm

Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/21589
  
@MaxGekk We are going in circles.
I dont think this is a good api to expose currently - the data is available 
through multiple other means as I detailed and while not a succinct oneliner, 
it is useable.
Not to mention @markhamstra's comment.
Unless there is some other compelling reason for introducing this which I 
have missed; I am -1 on introducing this change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21806: [SPARK-24846][SQL] Made hashCode ExprId independent of j...

2018-07-18 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21806
  
@hvanhovell Got it. Thanks for your explanation. LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21761: [SPARK-24771][BUILD]Upgrade Apache AVRO to 1.8.2

2018-07-18 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/21761
  
Hi, @gatorsmile .
Could you review this PR? I'm wondering if we can have this for Spark 2.4 
before branch-cut.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21710#discussion_r203538149
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/r/PrefixSpanWrapper.scala ---
@@ -0,0 +1,34 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.spark.ml.fpm.PrefixSpan
+
+private[r] object PrefixSpanWrapper {
+  def getPrefixSpan(
+  minSupport: Double,
+  maxPatternLength: Int,
+  maxLocalProjDBSize: Double,
--- End diff --

Ah, I see. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21810: [SPARK-24854][SQL] Gathering all Avro options int...

2018-07-18 Thread MaxGekk

GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/21810

[SPARK-24854][SQL] Gathering all Avro options into the AvroOptions class

## What changes were proposed in this pull request?

In the PR, I propose to put all `Avro` options in new class `AvroOptions` 
in the same way as for other datasources `JSON` and `CSV`.

## How was this patch tested?

It was tested by `AvroSuite`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 avro-options

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21810.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21810


commit 9da3a98d4e68697a0f4ab2b21be4b6aebeeb25e3
Author: Maxim Gekk 
Date:   2018-07-18T18:50:37Z

Adding a class for Avro Options

commit 3a76ba2de635fb6f653ee1d814ed0b87bcb12fb6
Author: Maxim Gekk 
Date:   2018-07-18T21:37:08Z

Gathering all options in AvroOptions




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21810: [SPARK-24854][SQL] Gathering all Avro options into the A...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21810
  
**[Test build #93251 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93251/testReport)**
 for PR 21810 at commit 
[`3a76ba2`](https://github.com/apache/spark/commit/3a76ba2de635fb6f653ee1d814ed0b87bcb12fb6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21798: [SPARK-24836][SQL] New option for Avro datasource - igno...

2018-07-18 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/21798
  
Please, look at this PR: https://github.com/apache/spark/pull/21810 . It 
introduces `AvroOptions`. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21810: [SPARK-24854][SQL] Gathering all Avro options into the A...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21810
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21810: [SPARK-24854][SQL] Gathering all Avro options into the A...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21810
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/21589
  
> it's not terribly useful to know, e.g., that there are 5 million cores in 
the cluster if your Job is running in a scheduler pool that is restricted to 
using far fewer CPUs via the pool's maxShares

Is `defaultParallelism` more useful in the case?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread ssimeonov

Github user ssimeonov commented on the issue:

https://github.com/apache/spark/pull/21589
  
@mridulm your comments make an implicit assumption, which is quite 
incorrect: that Spark users read the Spark codebase and/or are aware of Spark 
internals. Please, consider this PR in the context of its intended audience who 
(a) do not read the source code and (b) hardly look at the API docs. What they 
read are things like Stack Overflow, the Databricks Guide, blog posts and 
(quite rarely) the occasional how-to-with-Spark book. The fact that something 
is possible with Spark doesn't make it easy or intuitive. The value of this PR 
is that it makes a common use case easy and intuitive.

Let's consider the practicality of your suggestions:

> Rely on defaultParallelism - this gives the expected result, unless 
explicitly overridden by user.

That doesn't address the core use case as the scope of change & effect is 
very different. In the targeted use cases, a user wants to explicitly control 
the level of parallelism relative to the current cluster physical state for 
potentially a single stage. Relying on `defaultParallelism` exposes the user to 
undesired side-effects as the setting can be changed by other, potentially 
unrelated code the user has no control over. Introducing unintended side 
effects, which your suggestion does, is poor design.

> If you need fine grained information about executors, use spark listener 
(it is trivial to keep a count with onExecutorAdded/onExecutorRemoved).

I'd suggest you reconsider your definition of "trivial". Normal Spark 
users, not people who work on Spark or at companies like Hortonworks whose job 
is to be Spark experts, have no idea what a listener is, have never hooked one 
and never will. Not to mention how much fun it is to do this from, say, R.

> If you simply want a current value without own listener - use REST api to 
query for current executors.

This type of suggestion is a prime example of ignoring Spark user concerns. 
You are comparing `sc.numExecutors` with:

1. Knowing that a REST API exists that can produce this result.
2. Learning the details of the API.
3. Picking a synchronous REST client in the language they are using Spark 
with.
4. Initializing the REST client with the correct endpoint which they 
obtain... somehow.
5. Formulating the request.
6. Parsing the response.

I don't think there is any need to say more about this suggestion.

Taking a step back, it is important to acknowledge that Spark has become a 
mass-market data platform product and start designing user-facing APIs with 
this in mind. If the teams I know are any indication, the majority of Spark 
users are not experienced backend/data engineers. They are data scientists and 
data hackers: people who are getting into big data via Spark. The imbalance is 
only going to grow. The criteria by which user-focused Spark APIs are evaluated 
should evolve accordingly. 

From an ease-of-use perspective, I'd argue the two new methods should be 
exposed to `SparkSession` also as this is the typical new user "entry point". 
For example, the data scientists on my team never use `SparkContext` but they 
do adjust stage parallelism via implicits equivalent to the ones proposed in 
this PR (to significant benefit in query execution performance).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21803: [SPARK-24849][SQL] Converting a value of StructType to a...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21803
  
**[Test build #93242 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93242/testReport)**
 for PR 21803 at commit 
[`f302777`](https://github.com/apache/spark/commit/f302777461e258e59e3158ec44cd78a43ae8758c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21803: [SPARK-24849][SQL] Converting a value of StructType to a...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21803
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21803: [SPARK-24849][SQL] Converting a value of StructType to a...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21803
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93242/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21810: [SPARK-24854][SQL] Gathering all Avro options into the A...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21810
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93251/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21810: [SPARK-24854][SQL] Gathering all Avro options into the A...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21810
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21810: [SPARK-24854][SQL] Gathering all Avro options into the A...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21810
  
**[Test build #93251 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93251/testReport)**
 for PR 21810 at commit 
[`3a76ba2`](https://github.com/apache/spark/commit/3a76ba2de635fb6f653ee1d814ed0b87bcb12fb6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21796: [SPARK-24833][K8S][WIP] Add host name aliases feature

2018-07-18 Thread liyinan926

Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/21796
  
I think we decided not to take any new configuration options with 
https://issues.apache.org/jira/browse/SPARK-24434 being worked on. @mccheah 
@foxish.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...

2018-07-18 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21795
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21584: [SPARK-24433][K8S] Initial R Bindings for SparkR on K8s

2018-07-18 Thread shaneknapp

Github user shaneknapp commented on the issue:

https://github.com/apache/spark/pull/21584
  
just wanted to briefly chime in:

TL;DR:  this build will fail until the PRB is running on our ubuntu build 
nodes.

we are currently blocked from testing this stuff w/the current iteration of 
the PRB on our jenkins for a couple of reasons:

1) the current pull request builder runs only on our old, ancient and 
crufty centos 6.9 hosts
2) these hosts have old, ancient and crufty installations of R (3.1.1) and 
associated libraries
3) trying to re-create (2) on our shiny, new ubuntu 16.04 boxen has, so 
far, been an exercise in futility.  :\
4) this is why we currently have a limited number of ubuntu boxes, and also 
why things like testing for spark/k8s integration requires an addition build 
*on top* of the standard PRB build to make things work.

i'm working now w/some folks at databricks (@weshsiao) to get this under 
control enough so that we can get things moving to the ubuntu hosts ASAP.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread markhamstra

Github user markhamstra commented on the issue:

https://github.com/apache/spark/pull/21589
  
No, defaultParallelism isn't more useful in that case, but that just starts 
getting to my overall assessment of this JIRA and PR: It smells of defining the 
problem to align with a preconception of the solution.

Exposing the driver's current accounting of the number of cores active in 
the cluster is not something that we couldn't do or didn't know how to do along 
time ago. Rather, it is something that those of us working on the scheduler 
chose not to do because of the expectation that putting this in the public API 
(and thereby implicitly encouraging its use) was likely to produce as many 
problems as it solves. This was primarily because of two factors: 1) The number 
of cores and executors is not static; 2) Closely tailoring a Job to some 
expectation of the number of available cores or executors is not obviously a 
correct thing to encourage in general. 

Whether from node failures, dynamic executor allocation, backend scheduler 
elasticity/preemption, or just other Jobs running under the same SparkContext, 
the number of cores and executors available to any particular Job when it is 
created can easily be different from what is available when any of its Stages 
actually runs.

Even if you could get reliable numbers for the cores and executors that 
will be available through the lifecycle of a Job, tailoring a Job to use all of 
those cores and executors is only the right thing to do in a subset of Spark 
use cases. For example, using many more executors than there are DFS partitions 
holding the data, or trying to use all of the cores when there are other Jobs 
pending, or trying to use all of the cores when another Job needs to acquire a 
minimum number for barrier scheduled execution, or trying to use more cores 
than a scheduling pool permits would all be examples of anti-patterns that 
would be more enabled by easy, context-free access to low-level numCores.

There definitely are use cases where users need to be able to set policy 
for whether particular jobs should be encouraged to use more or less of the 
cluster's resources, but I believe that that needs to be done at a much higher 
level of abstraction in a declarative form, and that policy likely needs to be 
enforced dynamically/adaptively at Stage boundaries. The under-developed and 
under-used dynamic shuffle partitioning code in Spark SQL starts to go in that 
direction. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread ssimeonov

Github user ssimeonov commented on the issue:

https://github.com/apache/spark/pull/21589
  
@markhamstra the purpose of this PR is not to address the topic of dynamic 
resource management in arbitrarily complex Spark environments. Most Spark users 
do not operate in such environments. It is to help simple Spark users refactor 
code such as

```scala
df.repartition(25) // and related repartition() + coalesce() variants
```

to make job execution take advantage of additional cores, when they are 
available. 

Asking for a greater degree of parallelism than the cores a job has 
available rarely has significant negative effects (for reasonable values). 
Asking for a low degree of parallelism when there are lots of cores available 
has significant negative effects, especially in the common real-world use cases 
where there is lots of data skew. That's the point that both you and @mridulm 
seem to be missing. The arguments about resources flexing during job execution 
to do change this. 

My team has used this simple technique for years on both static and 
autoscaling clusters and we've seen meaningful performance improvements in both 
ETL and ML/AI-related data production for data ranging from gigabytes to 
petabytes. The idea is simple enough that even data scientists can (and do) 
easily use it. That's the benefit of this PR and that's why I like it. The cost 
of this PR is adding two simple & clear methods. The cost-benefit analysis 
seems obvious.

I agree with you that lots more can be done to handle the general case of 
better matching job resource needs to cluster/pool resources. This work is 
going to take forever given the current priorities. Let's not deny the majority 
of Spark users simple & real execution benefits while we dream about amazing 
architectural improvements. 

When looking at the net present value of performance, the discount factor 
is large. Performance improvements now are a lot more valuable than performance 
improvements in the far future.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread markhamstra

Github user markhamstra commented on the issue:

https://github.com/apache/spark/pull/21589
  
@ssimeonov the purpose of a public API is not to offer hack solutions to a 
subset of problems. What is needed is a high-level, declarative abstraction 
that can be used to specify requested Job resource-usage policy. Exposing 
low-level Spark scheduling internals is not the way to achieve that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...

2018-07-18 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/21795
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21804
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21804
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1106/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21804: [SPARK-24268][SQL] Use datatype.catalogString in error m...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21804
  
**[Test build #93252 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93252/testReport)**
 for PR 21804 at commit 
[`5ffc793`](https://github.com/apache/spark/commit/5ffc793e8fadc0c7c8a1137d227c187f88be6ef5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread ssimeonov

Github user ssimeonov commented on the issue:

https://github.com/apache/spark/pull/21589
  
@markhamstra even the words you are using indicate that you are missing the 
intended audience.

> high-level, declarative abstraction that can be used to specify requested 
Job resource-usage policy

How exactly do you imagine data scientists using something like this as 
they hack in a Jupyter or Databricks notebook in Python to sample data from a 
10Tb dataset?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 >

301 - 400 of 598 matches

Mail list logo