[GitHub] spark pull request #22485: [SPARK-18364][YARN] Expose metrics for YarnShuffl...

2018-09-24 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22485#discussion_r219916796
  
--- Diff: 
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
 ---
@@ -168,6 +170,15 @@ protected void serviceInit(Configuration conf) throws 
Exception {
   TransportConf transportConf = new TransportConf("shuffle", new 
HadoopConfigProvider(conf));
   blockHandler = new ExternalShuffleBlockHandler(transportConf, 
registeredExecutorFile);
 
+  // register metrics on the block handler into the Node Manager's 
metrics system.
+  YarnShuffleServiceMetrics serviceMetrics =
+  new YarnShuffleServiceMetrics(blockHandler.getAllMetrics());
+
+  MetricsSystemImpl metricsSystem = (MetricsSystemImpl) 
DefaultMetricsSystem.instance();
+  metricsSystem.register(
--- End diff --

Sorry for the delay.

My explanation for using reflection was here: 
https://github.com/palantir/spark/pull/149/files#r107770857 -- basically 
`registerSource` isn't accessible from here.

As to why use `registerSource` vs `register`, I don't think I knew about 
`register` at the time. Looking at 
https://github.com/apache/hadoop/blame/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/metrics2/impl/MetricsSystemImpl.java
 `registerSource` ends up being called both ways.

If you can get the right behavior using the accessible method rather than 
doing reflection, I'd say we should go with that way.

cc @robert3005 @mccheah fysa for potential future merge conflict


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22485: [SPARK-18364][YARN] Expose metrics for YarnShuffleServic...

2018-09-20 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/22485
  
Sorry for not following through on getting this into Apache.

FWIW, it's been in the Palantir fork of Spark for over a year: 
https://github.com/palantir/spark/search?q=SPARK-18364_q=SPARK-18364


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21334: [minor][SQL]Improve ParseError stop location when offend...

2018-05-15 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/21334
  
Hi @rubenfiszel thanks for the contribution!  Can you please take a glance 
through http://spark.apache.org/contributing.html to see the best way to get 
your change merged into Apache Spark?

I'd suggest you:
- file an issue at https://issues.apache.org/jira/projects/SPARK and put 
that in the PR title
- include a test that verifies the fix is working as you expect

Cheers!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20372: Improved block merging logic for partitions

2018-01-24 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/20372
  
Tagging folks who have touched this code recently: @vgankidi @ericl @davies 

This seems to provide a more compact packing in every scenario, which 
should improve execution times.  One risk is that individual partitions are no 
longer always contiguous ranges of files in order, but rather sometimes they 
have a gap.  In the test this is the `(file1, file6)` partition.  If something 
depends on this past behavior it could now break, though I don't think anything 
should be requiring this partition ordering.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20372: Improved block merging logic for partitions

2018-01-23 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/20372
  
Please fix the scala style checks --

```
Running Scala style checks

Scalastyle checks failed at following occurrences:
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala:459:
 File line length exceeds 100 characters
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala:463:
 File line length exceeds 100 characters
[error] Total time: 14 s, completed Jan 23, 2018 10:44:36 PM
[error] running 
/home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received 
return code 1
```

and verify locally with `./dev/lint-scala`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20372: Improved block merging logic for partitions

2018-01-23 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20372#discussion_r163464207
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -445,16 +445,25 @@ case class FileSourceScanExec(
   currentSize = 0
 }
 
-// Assign files to partitions using "Next Fit Decreasing"
-splitFiles.foreach { file =>
-  if (currentSize + file.length > maxSplitBytes) {
+def addFile(file: PartitionedFile): Unit = {
+currentFiles += file
+currentSize += file.length + openCostInBytes
+}
+
+var frontIndex = 0
+var backIndex = splitFiles.length - 1
+
+while (frontIndex <= backIndex) {
+while (frontIndex <= backIndex && currentSize + 
splitFiles(frontIndex).length <= maxSplitBytes) {
+addFile(splitFiles(frontIndex))
+frontIndex += 1
+}
+while (backIndex > frontIndex && currentSize + 
splitFiles(backIndex).length <= maxSplitBytes) {
+addFile(splitFiles(backIndex))
+backIndex -= 1
+}
 closePartition()
-  }
-  // Add the given file to the current partition.
-  currentSize += file.length + openCostInBytes
--- End diff --

saw you added a commit to handle the large non-splittable files case -- can 
you please add a test for that also?

want to make sure this while loop never becomes an infinite loop!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20372: Improved block merging logic for partitions

2018-01-23 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/20372
  
Jenkins, this is ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20372: Improved block merging logic for partitions

2018-01-23 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20372#discussion_r163419745
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala
 ---
@@ -142,15 +142,16 @@ class FileSourceStrategySuite extends QueryTest with 
SharedSQLContext with Predi
 SQLConf.FILES_OPEN_COST_IN_BYTES.key -> "1") {
   checkScan(table.select('c1)) { partitions =>
 // Files should be laid out [(file1), (file2, file3), (file4, 
file5), (file6)]
-assert(partitions.size == 4, "when checking partitions")
-assert(partitions(0).files.size == 1, "when checking partition 1")
+assert(partitions.size == 3, "when checking partitions")
+assert(partitions(0).files.size == 2, "when checking partition 1")
 assert(partitions(1).files.size == 2, "when checking partition 2")
 assert(partitions(2).files.size == 2, "when checking partition 3")
-assert(partitions(3).files.size == 1, "when checking partition 4")
 
 // First partition reads (file1)
--- End diff --

comment is stale now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20372: Improved block merging logic for partitions

2018-01-23 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20372#discussion_r163424784
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -445,16 +445,25 @@ case class FileSourceScanExec(
   currentSize = 0
 }
 
-// Assign files to partitions using "Next Fit Decreasing"
-splitFiles.foreach { file =>
-  if (currentSize + file.length > maxSplitBytes) {
+def addFile(file: PartitionedFile): Unit = {
+currentFiles += file
+currentSize += file.length + openCostInBytes
+}
+
+var frontIndex = 0
+var backIndex = splitFiles.length - 1
+
+while (frontIndex <= backIndex) {
+while (frontIndex <= backIndex && currentSize + 
splitFiles(frontIndex).length <= maxSplitBytes) {
+addFile(splitFiles(frontIndex))
+frontIndex += 1
+}
+while (backIndex > frontIndex && currentSize + 
splitFiles(backIndex).length <= maxSplitBytes) {
+addFile(splitFiles(backIndex))
+backIndex -= 1
+}
 closePartition()
-  }
-  // Add the given file to the current partition.
-  currentSize += file.length + openCostInBytes
--- End diff --

should add a test for this too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20372: Improved block merging logic for partitions

2018-01-23 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20372#discussion_r163419675
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala
 ---
@@ -142,15 +142,16 @@ class FileSourceStrategySuite extends QueryTest with 
SharedSQLContext with Predi
 SQLConf.FILES_OPEN_COST_IN_BYTES.key -> "1") {
   checkScan(table.select('c1)) { partitions =>
 // Files should be laid out [(file1), (file2, file3), (file4, 
file5), (file6)]
--- End diff --

comment is stale now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20372: Improved block merging logic for partitions

2018-01-23 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20372#discussion_r163424415
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -445,16 +445,25 @@ case class FileSourceScanExec(
   currentSize = 0
 }
 
-// Assign files to partitions using "Next Fit Decreasing"
-splitFiles.foreach { file =>
-  if (currentSize + file.length > maxSplitBytes) {
+def addFile(file: PartitionedFile): Unit = {
+currentFiles += file
+currentSize += file.length + openCostInBytes
+}
+
+var frontIndex = 0
+var backIndex = splitFiles.length - 1
+
+while (frontIndex <= backIndex) {
+while (frontIndex <= backIndex && currentSize + 
splitFiles(frontIndex).length <= maxSplitBytes) {
+addFile(splitFiles(frontIndex))
+frontIndex += 1
+}
+while (backIndex > frontIndex && currentSize + 
splitFiles(backIndex).length <= maxSplitBytes) {
+addFile(splitFiles(backIndex))
+backIndex -= 1
+}
 closePartition()
-  }
-  // Add the given file to the current partition.
-  currentSize += file.length + openCostInBytes
--- End diff --

this handles files that are larger than the maxSplitBytes, which I think 
isn't done in the new algorithm.  Need to make sure those don't form an 
infinite loop


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19917: Add failing test for select with a splatted strea...

2017-12-06 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/19917

Add failing test for select with a splatted stream

## What changes were proposed in this pull request?

Add additional test.

## How was this patch tested?

Additional test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark aash/PDS-59284

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19917.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19917


commit 2535b214ea5349209aecaa4b93868df548077c31
Author: Andrew Ash <and...@andrewash.com>
Date:   2017-12-07T00:50:27Z

Add test for PDS-59284




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can break when...

2017-11-29 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/19257
  
@cloud-fan @gatorsmile any more changes needed on this PR before merging?  
I don't see any un-addressed comments left.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19829: [WIP]Upgrade Netty to 4.1.17

2017-11-27 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/19829
  
Looks like a fix for https://issues.apache.org/jira/browse/SPARK-19552 -- 
should that be reopened now that netty is deprecating 4.0.x so we can't do it 
"Later" anymore?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

2017-11-16 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19702#discussion_r151569727
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
 ---
@@ -372,23 +381,18 @@ private[parquet] class ParquetSchemaConverter(
   // `TIMESTAMP_MICROS` which are both logical types annotating 
`INT64`.
   //
   // Originally, Spark SQL uses the same nanosecond timestamp type as 
Impala and Hive.  Starting
-  // from Spark 1.5.0, we resort to a timestamp type with 100 ns 
precision so that we can store
-  // a timestamp into a `Long`.  This design decision is subject to 
change though, for example,
-  // we may resort to microsecond precision in the future.
-  //
-  // For Parquet, we plan to write all `TimestampType` value as 
`TIMESTAMP_MICROS`, but it's
-  // currently not implemented yet because parquet-mr 1.8.1 (the 
version we're currently using)
-  // hasn't implemented `TIMESTAMP_MICROS` yet, however it supports 
TIMESTAMP_MILLIS. We will
-  // encode timestamp values as TIMESTAMP_MILLIS annotating INT64 if
-  // 'spark.sql.parquet.int64AsTimestampMillis' is set.
-  //
-  // TODO Converts `TIMESTAMP_MICROS` once parquet-mr implements that.
-
-  case TimestampType if writeTimestampInMillis =>
-Types.primitive(INT64, 
repetition).as(TIMESTAMP_MILLIS).named(field.name)
-
+  // from Spark 1.5.0, we resort to a timestamp type with microsecond 
precision so that we can
--- End diff --

I think this comment change loses some historical context about behavior 
around 1.5.0.  From 1.5.0 to 2.2.0 timestamps were with 100ns precision, and 
now with this change, starting in 2.3.0 they are microsecond precision, right?

The change as written seems to retroactively change the described behavior, 
unless comment was wrong


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19708: [SPARK-22479][SQL] Exclude credentials from Savei...

2017-11-14 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19708#discussion_r151010772
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommandSuite.scala
 ---
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.SaveMode
+import org.apache.spark.sql.test.SharedSQLContext
+
+class SaveIntoDataSourceCommandSuite extends SharedSQLContext {
+
+  override protected def sparkConf: SparkConf = super.sparkConf
+  .set("spark.redaction.string.regex", "(?i)password|url")
+
+  test("treeString is redacted") {
--- End diff --

old test name?  we're not modifying the treeString anymore, it's just the 
`SaveIntoDataSourceCommand`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19708: [SPARK-22479][SQL] Exclude credentials from SaveintoData...

2017-11-09 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/19708
  
Jenkins, this is ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19694: [SPARK-22470][DOC][SQL] functions.hash is also us...

2017-11-08 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/19694

[SPARK-22470][DOC][SQL] functions.hash is also used internally for shuffle 
and bucketing

## What changes were proposed in this pull request?

Add clarifying documentation to the scaladoc of `functions.hash`

## How was this patch tested?

Existing tests, though this is a docs-only change.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark patch-4

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19694.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19694


commit 94278a7f76e5383228c7d974825abb80f4f0f8b9
Author: Andrew Ash <and...@andrewash.com>
Date:   2017-11-08T10:56:31Z

[SPARK-22470][DOC][SQL] functions.hash is also used internally for shuffle 
and bucketing




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...

2017-10-25 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19468#discussion_r147021138
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala
 ---
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.scheduler.cluster.k8s
+
+import scala.collection.JavaConverters._
+
+import io.fabric8.kubernetes.api.model._
+
+import org.apache.spark.{SparkConf, SparkException}
+import org.apache.spark.deploy.k8s.ConfigurationUtils
+import org.apache.spark.deploy.k8s.config._
+import org.apache.spark.deploy.k8s.constants._
+import org.apache.spark.util.Utils
+
+/**
+ * Configures executor pods. Construct one of these with a SparkConf to 
set up properties that are
+ * common across all executors. Then, pass in dynamic parameters into 
createExecutorPod.
+ */
+private[spark] trait ExecutorPodFactory {
+  def createExecutorPod(
+  executorId: String,
+  applicationId: String,
+  driverUrl: String,
+  executorEnvs: Seq[(String, String)],
+  driverPod: Pod,
+  nodeToLocalTaskCount: Map[String, Int]): Pod
+}
+
+private[spark] class ExecutorPodFactoryImpl(sparkConf: SparkConf)
+  extends ExecutorPodFactory {
+
+  import ExecutorPodFactoryImpl._
+
+  private val executorExtraClasspath = sparkConf.get(
+org.apache.spark.internal.config.EXECUTOR_CLASS_PATH)
+  private val executorJarsDownloadDir = 
sparkConf.get(INIT_CONTAINER_JARS_DOWNLOAD_LOCATION)
+
+  private val executorLabels = 
ConfigurationUtils.parsePrefixedKeyValuePairs(
+sparkConf,
+KUBERNETES_EXECUTOR_LABEL_PREFIX,
+"executor label")
+  require(
+!executorLabels.contains(SPARK_APP_ID_LABEL),
+s"Custom executor labels cannot contain $SPARK_APP_ID_LABEL as it is 
reserved for Spark.")
+  require(
+!executorLabels.contains(SPARK_EXECUTOR_ID_LABEL),
+s"Custom executor labels cannot contain $SPARK_EXECUTOR_ID_LABEL as it 
is reserved for" +
+  s" Spark.")
+
+  private val executorAnnotations =
+ConfigurationUtils.parsePrefixedKeyValuePairs(
+  sparkConf,
+  KUBERNETES_EXECUTOR_ANNOTATION_PREFIX,
+  "executor annotation")
+  private val nodeSelector =
+ConfigurationUtils.parsePrefixedKeyValuePairs(
+  sparkConf,
+  KUBERNETES_NODE_SELECTOR_PREFIX,
+  "node selector")
+
+  private val executorDockerImage = sparkConf.get(EXECUTOR_DOCKER_IMAGE)
+  private val dockerImagePullPolicy = 
sparkConf.get(DOCKER_IMAGE_PULL_POLICY)
+  private val executorPort = sparkConf.getInt("spark.executor.port", 
DEFAULT_STATIC_PORT)
+  private val blockmanagerPort = sparkConf
+.getInt("spark.blockmanager.port", DEFAULT_BLOCKMANAGER_PORT)
+  private val kubernetesDriverPodName = sparkConf
+.get(KUBERNETES_DRIVER_POD_NAME)
+.getOrElse(throw new SparkException("Must specify the driver pod 
name"))
+
+  private val executorPodNamePrefix = 
sparkConf.get(KUBERNETES_EXECUTOR_POD_NAME_PREFIX)
+
+  private val executorMemoryMiB = 
sparkConf.get(org.apache.spark.internal.config.EXECUTOR_MEMORY)
+  private val executorMemoryString = sparkConf.get(
+org.apache.spark.internal.config.EXECUTOR_MEMORY.key,
+org.apache.spark.internal.config.EXECUTOR_MEMORY.defaultValueString)
+
+  private val memoryOverheadMiB = sparkConf
+.get(KUBERNETES_EXECUTOR_MEMORY_OVERHEAD)
+.getOrElse(math.max((MEMORY_OVERHEAD_FACTOR * executorMemoryMiB).toInt,
+  MEMORY_OVERHEAD_MIN_MIB))
+  private val executorMemoryWithOverhead = executorMemoryMiB + 
memoryOverheadMiB
+
+  private val executorCores = sparkConf.getDouble("spark.executor.cores", 
1d)
+  private val executorLimitCores = 
sparkConf.g

[GitHub] spark pull request #19574: [SPARK-21991][LAUNCHER][FOLLOWUP] Fix java lint

2017-10-25 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/19574

[SPARK-21991][LAUNCHER][FOLLOWUP] Fix java lint

## What changes were proposed in this pull request?

Fix java lint

## How was this patch tested?

Run `./dev/lint-java`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark aash/fix-java-lint

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19574.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19574


commit f393c051ad2fc6b4be63447e3109fafbbea0be40
Author: Andrew Ash <and...@andrewash.com>
Date:   2017-10-25T18:01:38Z

[SPARK-21991][LAUNCHER][FOLLOWUP] Fix java lint




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19217: [SPARK-21991][LAUNCHER] Fix race condition in LauncherSe...

2017-10-25 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/19217
  
https://github.com/apache/spark/pull/19574


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19217: [SPARK-21991][LAUNCHER] Fix race condition in LauncherSe...

2017-10-25 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/19217
  
```

Running Java style checks

Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[238] 
(regexp) RegexpSingleline: No trailing whitespace allowed.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[247] 
(regexp) RegexpSingleline: No trailing whitespace allowed.
[error] running /home/ubuntu/spark/dev/lint-java ; received return code 1
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19217: [SPARK-21991][LAUNCHER] LauncherServer acceptConnections...

2017-10-24 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/19217
  
@nivox can you please update the PR title when you get the chance?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19217: [SPARK-21991][LAUNCHER] LauncherServer acceptConnections...

2017-10-24 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/19217
  
How about `[SPARK-21991][LAUNCHER] Fix race condition in 
LauncherServer#acceptConnections` ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19217: [SPARK-21991][LAUNCHER] LauncherServer acceptConn...

2017-10-24 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19217#discussion_r146493105
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java ---
@@ -232,20 +232,20 @@ public void run() {
 };
 ServerConnection clientConnection = new ServerConnection(client, 
timeout);
 Thread clientThread = factory.newThread(clientConnection);
-synchronized (timeout) {
--- End diff --

we're no longer synchronizing on the `timeout` here, but I didn't see 
anywhere else synchronizing on it either (including `ServerConnection`).  Given 
it doesn't escape this method, I'm not sure how multiple threads could ever 
access `timeout` at once, so it makes sense to remove this synchronization


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19217: [SPARK-21991][LAUNCHER] LauncherServer acceptConn...

2017-10-24 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19217#discussion_r146492057
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java ---
@@ -232,20 +232,20 @@ public void run() {
 };
 ServerConnection clientConnection = new ServerConnection(client, 
timeout);
 Thread clientThread = factory.newThread(clientConnection);
-synchronized (timeout) {
-  clientThread.start();
-  synchronized (clients) {
-clients.add(clientConnection);
-  }
-  long timeoutMs = getConnectionTimeout();
-  // 0 is used for testing to avoid issues with clock resolution / 
thread scheduling,
-  // and force an immediate timeout.
-  if (timeoutMs > 0) {
-timeoutTimer.schedule(timeout, getConnectionTimeout());
-  } else {
-timeout.run();
-  }
+synchronized (clients) {
+  clients.add(clientConnection);
--- End diff --

we are now adding to `clients` before starting the `clientThread` instead 
of after.  What's the expected ordering for these two operations?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19217: [SPARK-21991][LAUNCHER] LauncherServer acceptConn...

2017-10-24 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19217#discussion_r146493161
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java ---
@@ -232,20 +232,20 @@ public void run() {
 };
 ServerConnection clientConnection = new ServerConnection(client, 
timeout);
 Thread clientThread = factory.newThread(clientConnection);
-synchronized (timeout) {
-  clientThread.start();
-  synchronized (clients) {
-clients.add(clientConnection);
-  }
-  long timeoutMs = getConnectionTimeout();
-  // 0 is used for testing to avoid issues with clock resolution / 
thread scheduling,
-  // and force an immediate timeout.
-  if (timeoutMs > 0) {
-timeoutTimer.schedule(timeout, getConnectionTimeout());
-  } else {
-timeout.run();
-  }
+synchronized (clients) {
+  clients.add(clientConnection);
+}
+
+long timeoutMs = getConnectionTimeout();
+// 0 is used for testing to avoid issues with clock resolution / 
thread scheduling,
+// and force an immediate timeout.
+if (timeoutMs > 0) {
+  timeoutTimer.schedule(timeout, timeoutMs);
--- End diff --

+1 on not calling `getConnectionTimeout()` multiple times


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19486: [SPARK-22268][BUILD] Fix lint-java

2017-10-19 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/19486
  
Updated


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-17 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19269#discussion_r145296142
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Command.scala
 ---
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.spark.{SparkException, TaskContext}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.{Row, SparkSession}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, 
RowEncoder}
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.command.RunnableCommand
+import org.apache.spark.sql.sources.v2.writer._
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+
+case class WriteToDataSourceV2Command(writer: DataSourceV2Writer, query: 
LogicalPlan)
+  extends RunnableCommand {
--- End diff --

I've also observed this issue where the explain output of commands behaves 
differently than from logical plans, and have a repro at 
https://issues.apache.org/jira/browse/SPARK-22204


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19486: [SPARK-22268][BUILD] Fix lint-java

2017-10-12 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/19486

[SPARK-22268][BUILD] Fix lint-java

## What changes were proposed in this pull request?

Fix java style issues

## How was this patch tested?

Run `./dev/lint-java` locally since it's not run on Jenkins

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark aash/fix-lint-java

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19486.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19486


commit 7c321fc7bc9e448ea3b7177089a8e814d901dd26
Author: Andrew Ash <and...@andrewash.com>
Date:   2017-10-12T21:57:19Z

Fix lint-java

./dev/lint-java




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

2017-10-10 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/19468
  
Jenkins, ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19131: [MINOR][SQL]remove unuse import class

2017-09-08 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/19131
  
A check for unused imports should be added to scalastyle to prevent these 
from creeping back in.  If this PR was accompanied with that check (failing 
before, now passing) I think the merge conflicts are worth it to permanently 
fix the problem.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19164: [SPARK-21953] Show both memory and disk bytes spi...

2017-09-08 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/19164

[SPARK-21953] Show both memory and disk bytes spilled if either is present



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark patch-3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19164.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19164


commit 7852abeaa22cd15e830004c99e035d37cf2d776c
Author: Andrew Ash <and...@andrewash.com>
Date:   2017-09-08T07:48:54Z

[SPARK-21953] Show both memory and disk bytes spilled if either is present




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [DO NOT MERGE][SPARK-15689][SQL] data source v2

2017-09-07 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r137469790
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import java.io.Serializable;
+
+/**
+ * A read task returned by a data source reader and is responsible to 
create the data reader.
+ * The relationship between `ReadTask` and `DataReader` is similar to 
`Iterable` and `Iterator`.
+ *
+ * Note that, the read task will be serialized and sent to executors, then 
the data reader will be
+ * created on executors and do the actual reading.
+ */
+public interface ReadTask extends Serializable {
+  /**
+   * The preferred locations for this read task to run faster, but Spark 
can't guarantee that this
+   * task will always run on these locations. Implementations should make 
sure that it can
+   * be run on any location.
+   */
+  default String[] preferredLocations() {
--- End diff --

what format are these strings expected to be in?  If Spark will be placing 
this ReadTask onto an executor that is a preferred location, the format will 
need to be a documented part of the API

are there levels of preference, or only the binary?  I'm thinking node vs 
rack vs datacenter for on-prem clusters, or instance vs AZ vs region for cloud 
clusers


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19136: [DO NOT MERGE][SPARK-15689][SQL] data source v2

2017-09-07 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19136#discussion_r137471056
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/upward/StatisticsSupport.java
 ---
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader.upward;
+
+/**
+ * A mix in interface for `DataSourceV2Reader`. Users can implement this 
interface to report
+ * statistics to Spark.
+ */
+public interface StatisticsSupport {
--- End diff --

some datasources have per-column statistics, like how many bytes a column 
has or its min/max (e.g. things required for CBO).

should that be a separate interface from this one?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19153: SPARK-21941 Stop storing unused attemptId in SQLT...

2017-09-07 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/19153

SPARK-21941 Stop storing unused attemptId in SQLTaskMetrics

## What changes were proposed in this pull request?

In a driver heap dump containing 390,105 instances of SQLTaskMetrics this
would have saved approximately 3.2MB of memory.

Since we're not getting any benefit from storing this unused value, let's
eliminate it until a future PR makes use of it.

## How was this patch tested?

Existing unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark aash/trim-sql-listener

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19153.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19153


commit 4490740b93d3b5d8f1f46e49a608115fe0d37a57
Author: Andrew Ash <and...@andrewash.com>
Date:   2017-09-07T06:05:30Z

SPARK-21941 Stop storing unused attemptId in SQLTaskMetrics

In a driver heap dump containing 390,105 instances of SQLTaskMetrics this
would have saved approximately 3.2MB of memory.

Since we're not getting any benefit from storing this unused value, let's
eliminate it until a future PR makes use of it.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19088: [SPARK-21875][BUILD] Fix Java style bugs

2017-08-30 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/19088

[SPARK-21875][BUILD] Fix Java style bugs

## What changes were proposed in this pull request?

Fix Java code style so `./dev/lint-java` succeeds

## How was this patch tested?

Run `./dev/lint-java`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark spark-21875-lint-java

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19088.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19088






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18996: [MINOR][TYPO] Fix typos: runnning and Excecutors

2017-08-18 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/18996

[MINOR][TYPO] Fix typos: runnning and Excecutors

## What changes were proposed in this pull request?

Fix typos

## How was this patch tested?

Existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark patch-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18996.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18996


commit cf749df92de7e1485752610a2396a14724431250
Author: Andrew Ash <and...@andrewash.com>
Date:   2017-08-18T20:02:36Z

[MINOR][TYPO] Fix typos: runnning and Excecutors




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18913: [SPARK-21563][CORE] Fix race condition when seria...

2017-08-11 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18913#discussion_r132721021
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -1792,6 +1796,9 @@ class SparkContext(config: SparkConf) extends Logging 
{
 
   /**
* Adds a JAR dependency for all tasks to be executed on this 
`SparkContext` in the future.
+   *
+   * If a jar is added during execution, it will not be available until 
the next TaskSet starts.
--- End diff --

This additional sentence is probably overkill, since I'm sure there are 
many other subtleties about these methods that aren't spelled out in the 
Javadoc.

I'm happy to remove and keep these docs simple if that's what a committer 
would prefer before merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18913: [SPARK-21563][CORE] Fix race condition when seria...

2017-08-10 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/18913

[SPARK-21563][CORE] Fix race condition when serializing TaskDescriptions 
and adding jars

## What changes were proposed in this pull request?

Fix the race condition when serializing TaskDescriptions and adding jars by 
keeping the set of jars and files for a TaskSet constant across the lifetime of 
the TaskSet.  Otherwise TaskDescription serialization can produce an invalid 
serialization when new file/jars are added concurrently as the TaskDescription 
is serialized.

## How was this patch tested?

Additional unit test ensures jars/files contained in the TaskDescription 
remain constant throughout the lifetime of the TaskSet.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark SPARK-21563

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18913.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18913


commit e874fbeeff532c4351ba7318c750cd322c4a24f5
Author: Andrew Ash <and...@andrewash.com>
Date:   2017-08-10T23:32:15Z

Add test

commit b06425f7267e2f0e478000c30b60b963291aacb0
Author: Andrew Ash <and...@andrewash.com>
Date:   2017-08-10T23:40:49Z

Fix the test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18877: [SPARK-17742][core] Handle child process exit in ...

2017-08-10 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18877#discussion_r132583520
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java ---
@@ -118,14 +116,40 @@ void setChildProc(Process childProc, String 
loggerName, InputStream logStream) {
 this.childProc = childProc;
 if (logStream != null) {
   this.redirector = new OutputRedirector(logStream, loggerName,
-SparkLauncher.REDIRECTOR_FACTORY);
+SparkLauncher.REDIRECTOR_FACTORY, this);
+} else {
+  // If there is no log redirection, spawn a thread that will wait for 
the child process
+  // to finish.
+  Thread waiter = 
SparkLauncher.REDIRECTOR_FACTORY.newThread(this::monitorChild);
+  waiter.setDaemon(true);
+  waiter.start();
 }
   }
 
   void setConnection(LauncherConnection connection) {
 this.connection = connection;
   }
 
+  /**
+   * Callback for when the child process exits. Forcefully put the 
application in a final state,
+   * overwriting the current final state unless it is already FAILED.
+   */
+  synchronized void childProcessExited() {
+disconnect();
+
+int ec;
+try {
+  ec = childProc.exitValue();
+} catch (Exception e) {
--- End diff --

might want to log the exception here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18877: [SPARK-17742][core] Handle child process exit in ...

2017-08-10 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18877#discussion_r132583553
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java ---
@@ -166,4 +185,15 @@ private synchronized void fireEvent(boolean 
isInfoChanged) {
 }
   }
 
+  private void monitorChild() {
+while (childProc.isAlive()) {
+  try {
+childProc.waitFor();
+  } catch (Exception e) {
+// Try again.
--- End diff --

log exception here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18789: SPARK-20433 Bump jackson from 2.6.5 to 2.6.7.1

2017-08-09 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/18789
  
@srowen sorry for not picking up on this -- thanks for pushing it over the 
finish line in your PR!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18789: Bump jackson from 2.6.5 to 2.6.7.1 (#241)

2017-07-31 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/18789

Bump jackson from 2.6.5 to 2.6.7.1 (#241)

This brings in a security fix for CVE-2017-7525 in the jackson-databind 
library, which Spark uses.

When releasing this patch, upstream released a 2.6.7.1 for jackson-databind 
but not a corresponding 2.6.7.1 for the rest of jackson, so those only go up to 
2.6.7

This requires splitting the version variable in /pom.xml

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark SPARK-20433

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18789.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18789


commit 35e69bd0066d5c8a978f71fde369bf103c8e9a5a
Author: Andrew Ash <and...@andrewash.com>
Date:   2017-07-27T03:58:41Z

Bump jackson from 2.6.5 to 2.6.7.1 (#241)

Would use 2.6.7 everywhere but upstream released a 2.6.7.1 for
jackson-databind but not a corresponding 2.6.7 for the rest of jackson, so
those remain on 2.6.7

This requires splitting the version variable in /pom.xml




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18658: [SPARK-20871][SQL] only log Janino code at debug level

2017-07-17 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/18658
  
FYI for future reviewers as well, we've been running an [extremely similar 
patch](https://github.com/palantir/spark/pull/181) to PJ's on our distribution 
of Spark for the past several months and had no problems.

Without this change, a code compilation failure often escalates into a heap 
OOM as the too-large source code is replicated in memory through the log 
statement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18658: [SPARK-20871][SQL] only log Janino code at debug ...

2017-07-17 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18658#discussion_r127819158
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1037,24 +1037,22 @@ object CodeGenerator extends Logging {
 ))
 evaluator.setExtendedClass(classOf[GeneratedClass])
 
-lazy val formatted = CodeFormatter.format(code)
-
 logDebug({
   // Only add extra debugging info to byte code when we are going to 
print the source code.
   evaluator.setDebuggingInformation(true, true, false)
-  s"\n$formatted"
+  s"\n${CodeFormatter.format(code)}"
--- End diff --

I'd suggest dropping the string concatenation with `\n` here -- it requires 
an additional copy of the code to be held in-memory and for errors where the 
code is too long, this causes unnecessary additional pressure on the heap


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18658: [SPARK-20871] only log Janino code at debug level

2017-07-17 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/18658
  
Jenkins this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18621: [SPARK-21400][SQL] Don't overwrite output committers on ...

2017-07-13 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/18621
  
jenkins this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...

2017-07-10 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18581#discussion_r126534985
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMOptions.scala ---
@@ -41,11 +41,15 @@ private[libsvm] class LibSVMOptions(@transient private 
val parameters: CaseInsen
 case o => throw new IllegalArgumentException(s"Invalid value `$o` for 
parameter " +
   s"`$VECTOR_TYPE`. Expected types are `sparse` and `dense`.")
   }
+
+  val lineSeparator: Option[String] = parameters.get(LINE_SEPARATOR)
--- End diff --

I've seen datasets that have multi-character delimiters of more than 2 
characters.  Specifically `|~|`

So yes there is a use case, but it's a long tail one.  I'd be happy to get 
this progress of up to 2 characters and work towards 3+ in a future PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18176: [SPARK-20952] Make TaskContext an InheritableTheadLocal

2017-06-27 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/18176
  
Jenkins this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18406: [SPARK-21195] Automatically register new metrics from so...

2017-06-26 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/18406
  
@robert3005 looks like a bunch of tests are failing with 
`java.lang.IllegalArgumentException: A metric named 
local-1498509661743.driver.HiveExternalCatalog.fileCacheHits already exists`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18427: [SPARK-21219][scheduler] Fix race condition between addi...

2017-06-26 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/18427
  
Jenkins this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18406: [SPARK-21195] Automatically register new metrics from so...

2017-06-26 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/18406
  
Jenkins this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18209: [SPARK-20992][Scheduler] Add support for Nomad as a sche...

2017-06-22 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/18209
  
@manojlds I'm a part of the Spark-on-k8s team that's currently building k8s 
integration for Spark outside of the Apache Spark repo.  You can follow our 
work at https://github.com/apache-spark-on-k8s/spark

Here's the current outstanding diff on top of `apache/branch-2.1`: 
https://github.com/apache-spark-on-k8s/spark/pull/200 , produced from about 
6months of work in a varied group from approaching a dozen companies at this 
point.  We aim to eventually bring this into the Apache repo in some form, 
though the timeline for that is not yet fixed.

You can follow from the original design and discussion for the kubernetes 
efforts at https://issues.apache.org/jira/browse/SPARK-18278

One of the major pieces of feedback we received back in November was that, 
if possible, the kubernetes integration should be done as a plugin to the 
Apache Spark repo.  Unfortunately that plugin point does not yet exist in 
Spark.  So we filed https://issues.apache.org/jira/browse/SPARK-19700 to begin 
the process of creating that API but have not yet devoted effort to building 
that plugin point.  Once active development on the k8s integration begin to 
slow down and near completion (starting to see signs of this slowing this 
month) we'll probably shift focus to the plugin point as a means of gaining 
even wider use of the k8s integration.  On what a plugin point could possibly 
look like, we produced a thought experiment at 
https://github.com/palantir/spark/pull/81 which you may find interesting.

As of now, it remains unclear whether the k8s code will be brought in as a 
first-class citizen alongside Mesos and YARN, or whether Spark will gain a 
plugin point and the k8s integration.

We'd be happy to be a part of any wider efforts to create a plugin point 
for custom resource managers at SPARK-19700.  Is that something the Nomad team 
would be interested in contributing to?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17935: [SPARK-20690][SQL] Subqueries in FROM should have alias ...

2017-06-05 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/17935
  
@JoshRosen what was the other type of database you were using?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18176: [SPARK-20952] Make TaskContext an InheritableTheadLocal

2017-06-02 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/18176
  
Jenkins this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17680: [SPARK-20364][SQL] Support Parquet predicate pushdown on...

2017-05-09 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/17680
  
Are there any comments on this PR or is it ready to be merged?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17680: [SPARK-20364][SQL] Support Parquet predicate pushdown on...

2017-04-28 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/17680
  
Any further thoughts on this?  It was quite surprising for one of our users 
so I wanted to make sure it was fixed in a future Apache release


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-04-20 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17680#discussion_r112531312
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   // scalastyle:on nonascii
 }
   }
+
+  test("SPARK-20364: Predicate pushdown for columns with a '.' in them") {
+import testImplicits._
+
+Seq(true, false).foreach { vectorized =>
+  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
vectorized.toString) {
+withTempPath { path =>
+  Seq(Some(1), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` > 0").count() 
== 1)
+}
+
+withTempPath { path =>
+  Seq(Some(1L), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` >= 
1L").count() == 1)
+}
+
+withTempPath { path =>
+  Seq(Some(1.0F), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` < 
2.0").count() == 1)
+}
+
+withTempPath { path =>
+  Seq(Some(1.0D), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` <= 
1.0D").count() == 1)
+}
+
+withTempPath { path =>
+  Seq(true, 
false).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` == 
true").count() == 1)
+}
+
+withTempPath { path =>
+  Seq("apple", 
null).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  assert(
+spark.read.parquet(path.getAbsolutePath).where("`col.dots` IS 
NOT NULL").count() == 1)
--- End diff --

thanks for adding the additional test below


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-04-20 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17680#discussion_r112529697
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala
 ---
@@ -487,6 +487,20 @@ class FileSourceStrategySuite extends QueryTest with 
SharedSQLContext with Predi
 }
   }
 
+  test("no filter puwhdown for nested field access") {
--- End diff --

nit: pushdown


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-04-20 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17680#discussion_r112530989
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   // scalastyle:on nonascii
 }
   }
+
+  test("SPARK-20364: Predicate pushdown for columns with a '.' in them") {
--- End diff --

thanks for the change -- I wasn't sure if predicate pushdown worked on 
nested columns and it looks like that change confirms it does not after this 
change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-04-19 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17680#discussion_r112285677
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   // scalastyle:on nonascii
 }
   }
+
+  test("SPARK-20364: Predicate pushdown for columns with a '.' in them") {
--- End diff --

is there another existing test that checks that pushdown for 
`struct.field1` syntax works correctly?  I'm not sure how to reference those 
inner fields in a struct field as I don't use it much personally, but want to 
make sure that's not broken as a result of this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-04-19 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17680#discussion_r112285883
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   // scalastyle:on nonascii
 }
   }
+
+  test("SPARK-20364: Predicate pushdown for columns with a '.' in them") {
+import testImplicits._
+
+Seq(true, false).foreach { vectorized =>
+  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
vectorized.toString) {
+withTempPath { path =>
+  Seq(Some(1), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` > 0").count() 
== 1)
+}
+
+withTempPath { path =>
+  Seq(Some(1L), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` >= 
1L").count() == 1)
+}
+
+withTempPath { path =>
+  Seq(Some(1.0F), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` < 
2.0").count() == 1)
+}
+
+withTempPath { path =>
+  Seq(Some(1.0D), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` <= 
1.0D").count() == 1)
+}
+
+withTempPath { path =>
+  Seq(true, 
false).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` == 
true").count() == 1)
+}
+
+withTempPath { path =>
+  Seq("apple", 
null).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  assert(
+spark.read.parquet(path.getAbsolutePath).where("`col.dots` IS 
NOT NULL").count() == 1)
--- End diff --

please also do the check for `IS NULL` having 1 row too, so this is 
symmetric


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17667: Failing test for parquet predicate pushdown on co...

2017-04-19 Thread ash211
Github user ash211 closed the pull request at:

https://github.com/apache/spark/pull/17667


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17667: Failing test for parquet predicate pushdown on columns w...

2017-04-19 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/17667
  
Agreed, will close for now until there's a fix to go along with the test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17667: Failing test for parquet predicate pushdown on dots with...

2017-04-17 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/17667
  
@HyukjinKwon thanks for looking at this!  Please feel free to open a Jira 
so we can begin discussing a fix.  I haven't started working on a patch yet, 
only have the test case at this point.  So if you come up with a solution I'd 
love to see it!

cc @robert3005


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17667: Failing test for parquet predicate pushdown on do...

2017-04-17 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/17667

Failing test for parquet predicate pushdown on dots with columns

// checking against Jenkins to make sure this is still live on master

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark test/parquet-dots-in-column2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17667.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17667


commit 51cb89b0587d06a6911e14f5a4381bed0b6da2eb
Author: Andrew Ash <and...@andrewash.com>
Date:   2017-04-18T01:23:05Z

Failing test for parquet predicate pushdown on dots with columns




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17664: Typo fix: distitrbuted -> distributed

2017-04-17 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/17664

Typo fix: distitrbuted -> distributed

## What changes were proposed in this pull request?

Typo fix: distitrbuted -> distributed

## How was this patch tested?

Existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17664.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17664


commit bc152fa1f7ed29ffa904bdf9a5ea0665be7a5bac
Author: Andrew Ash <and...@andrewash.com>
Date:   2017-04-17T23:48:30Z

Typo fix: distitrbuted -> distributed




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffleServic...

2017-03-28 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/17401
  
@jerryshao ready for re-review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffl...

2017-03-28 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17401#discussion_r108562943
  
--- Diff: 
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleServiceMetrics.java
 ---
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.yarn;
+
+import com.codahale.metrics.*;
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.metrics2.MetricsCollector;
+import org.apache.hadoop.metrics2.MetricsInfo;
+import org.apache.hadoop.metrics2.MetricsRecordBuilder;
+import org.apache.hadoop.metrics2.MetricsSource;
+
+import java.util.Map;
+
+/**
+ * Modeled off of YARN's NodeManagerMetrics.
+ */
+public class YarnShuffleServiceMetrics implements MetricsSource {
+
+  private final MetricSet metricSet;
+
+  public YarnShuffleServiceMetrics(MetricSet metricSet) {
+this.metricSet = metricSet;
+  }
+
+  /**
+   * Get metrics from the source
+   *
+   * @param collector to contain the resulting metrics snapshot
+   * @param all   if true, return all metrics even if unchanged.
+   */
+  @Override
+  public void getMetrics(MetricsCollector collector, boolean all) {
+MetricsRecordBuilder metricsRecordBuilder = 
collector.addRecord("shuffleService");
+
+for (Map.Entry<String, Metric> entry : 
metricSet.getMetrics().entrySet()) {
+  collectMetric(metricsRecordBuilder, entry.getKey(), 
entry.getValue());
+}
+  }
+
+  @VisibleForTesting
+  public static void collectMetric(MetricsRecordBuilder 
metricsRecordBuilder, String name, Metric metric) {
--- End diff --

I use `static` here to make it clear that the method does not need to be 
run in the context of an instance.  This prevents it from accidentally 
accessing instance variables when I don't intend it to


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffleServic...

2017-03-24 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/17401
  
Ready for further review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17411: logging improvements

2017-03-24 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/17411
  
Thanks for the contribution @juanrh !  I'm happy to see contributions no 
matter how small.  For larger changes you would need to file a Jira ticket, but 
this is small enough that it's not necessary.  See 
http://spark.apache.org/contributing.html point 2 in the "Jira" section.

Please do make the title more descriptive though -- rename to something 
like `[DOCS] More detailed logging around YARN executor allocation`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffleServic...

2017-03-24 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/17401
  
Thanks again for the comments @jerryshao !  I've now added some tests to 
verify that the metrics get converted in the expected way to the collector, and 
camel-cased shuffleService


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffl...

2017-03-24 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17401#discussion_r107840148
  
--- Diff: 
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleServiceMetrics.java
 ---
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.yarn;
+
+import com.codahale.metrics.*;
+import org.apache.hadoop.metrics2.MetricsCollector;
+import org.apache.hadoop.metrics2.MetricsInfo;
+import org.apache.hadoop.metrics2.MetricsRecordBuilder;
+import org.apache.hadoop.metrics2.MetricsSource;
+
+import java.util.Map;
+
+/**
+ * Modeled off of YARN's NodeManagerMetrics.
+ */
+public class YarnShuffleServiceMetrics implements MetricsSource {
+
+  private final MetricSet metricSet;
+
+  public YarnShuffleServiceMetrics(MetricSet metricSet) {
+this.metricSet = metricSet;
+  }
+
+  /**
+   * Get metrics from the source
+   *
+   * @param collector to contain the resulting metrics snapshot
+   * @param all   if true, return all metrics even if unchanged.
+   */
+  @Override
+  public void getMetrics(MetricsCollector collector, boolean all) {
--- End diff --

should be able to, I'm working on creating one now.  By correctness, I 
think you mostly mean that the values passed through are the same, even though 
the naming schemes are different?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffl...

2017-03-24 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17401#discussion_r107840109
  
--- Diff: 
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
 ---
@@ -166,6 +170,23 @@ protected void serviceInit(Configuration conf) throws 
Exception {
   TransportConf transportConf = new TransportConf("shuffle", new 
HadoopConfigProvider(conf));
   blockHandler = new ExternalShuffleBlockHandler(transportConf, 
registeredExecutorFile);
 
+  // register metrics on the block handler into the Node Manager's 
metrics system.
+  try {
+YarnShuffleServiceMetrics serviceMetrics = new 
YarnShuffleServiceMetrics(
+  blockHandler.getAllMetrics());
+MetricsSystemImpl metricsSystem = (MetricsSystemImpl) 
DefaultMetricsSystem.instance();
+
+Method registerSourceMethod = 
metricsSystem.getClass().getDeclaredMethod("registerSource",
+String.class, String.class, MetricsSource.class);
+registerSourceMethod.setAccessible(true);
+registerSourceMethod.invoke(metricsSystem, "shuffleservice", 
"Metrics on the Spark " +
--- End diff --

will change shortly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffleServic...

2017-03-23 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/17401
  
Thanks for taking a look @jerryshao !  I've reformatted to two-space 
indentation and run `./dev/lint-java` to make sure this code passes the linter.

`src/main/java/org/apache/spark/sql/streaming/GroupStateTimeout.java` still 
has some errors but those are separate from this change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffl...

2017-03-23 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/17401

[SPARK-18364][YARN] Expose metrics for YarnShuffleService

Registers the shuffle server's metrics with the Hadoop Node Manager's
DefaultMetricsSystem.

## What changes were proposed in this pull request?

Expose the shuffle service metrics not only on the ExternalShuffleService 
as done in SPARK-16405, but also in the YarnShuffleService.  Because the YARN 
Node Manager operates under

## How was this patch tested?

Manual deploy of new yarn-shuffle jar into a Node Manager and verifying 
that the metrics appear in the Node Manager-standard location.  This is JMX 
with an query endpoint at `/jmx`.

Resulting metrics look like this:

```
[user@host ~]$ curl -sk -XGET https://`hostname -f`:8042/jmx | jq . | grep 
'shuffleservice' -B 1 -A 18
{
  "name": "Hadoop:service=NodeManager,name=shuffleservice",
  "modelerType": "shuffleservice",
  "tag.Hostname": "",
  "openBlockRequestLatencyMillis_count": 1,
  "openBlockRequestLatencyMillis_rate15": 0.0011080303990206543,
  "openBlockRequestLatencyMillis_rate5": 0.0033057092356765017,
  "openBlockRequestLatencyMillis_rate1": 0.015991117074135343,
  "openBlockRequestLatencyMillis_rateMean": 0.003843993699021382,
  "blockTransferRateBytes_count": 118,
  "blockTransferRateBytes_rate15": 0.1307475870844372,
  "blockTransferRateBytes_rate5": 0.39007368980982715,
  "blockTransferRateBytes_rate1": 1.8869518147479705,
  "blockTransferRateBytes_rateMean": 0.45359183094454836,
  "registeredExecutorsSize": 2,
  "registerExecutorRequestLatencyMillis_count": 2,
  "registerExecutorRequestLatencyMillis_rate15": 0.001697343764758814,
  "registerExecutorRequestLatencyMillis_rate5": 0.002970701813078509,
  "registerExecutorRequestLatencyMillis_rate1": 0.0005857750515146702,
  "registerExecutorRequestLatencyMillis_rateMean": 0.007687995987242345
},
[user@host ~]$
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark spark-18364

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17401.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17401


commit 4caf4d28819d82240f1d61e11cdabd68fafad14a
Author: Andrew Ash <and...@andrewash.com>
Date:   2017-03-23T02:59:38Z

[SPARK-18364][YARN] Expose metrics for YarnShuffleService

Registers the shuffle server's metrics with the Hadoop Node Manager's
DefaultMetricsSystem.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17399: Update functions.scala

2017-03-23 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/17399
  
Thanks for contributing to Spark @roxannemoslehi !

I think Sean just means updating the title to something more like `[DOCS] 
Clarify round mode in format_number function`.  It doesn't feel big enough to 
be worth a Jira ticket for such a small docs change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14615: [SPARK-17029] make toJSON not go through rdd form but op...

2017-03-17 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/14615
  
Jenkins, this is ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14615: [SPARK-17029] make toJSON not go through rdd form but op...

2017-03-17 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/14615
  
@robert3005 looks like this has unit test failures on 
`org.apache.spark.sql.hive.orc.OrcSourceSuite.SPARK-19459/SPARK-18220: read 
char/varchar column written by Hive` -- is that a flake?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14615: [SPARK-17029] make toJSON not go through rdd form but op...

2017-03-17 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/14615
  
Jenkins, this is ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16959: [SPARK-19631][CORE] OutputCommitCoordinator should not a...

2017-03-02 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/16959
  
Any last changes before merging?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16959: [SPARK-19631][CORE] OutputCommitCoordinator shoul...

2017-02-28 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16959#discussion_r103576994
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala
 ---
@@ -195,6 +195,17 @@ class OutputCommitCoordinatorSuite extends 
SparkFunSuite with BeforeAndAfter {
 sc.runJob(rdd, 
OutputCommitFunctions(tempDir.getAbsolutePath).callCanCommitMultipleTimes _,
0 until rdd.partitions.size)
   }
+
+  test("SPARK-19631: Do not allow failed attempts to be authorized for 
committing") {
+val stage: Int = 1
+val partition: Int = 1
+val failedAttempt: Int = 0
+outputCommitCoordinator.stageStart(stage, maxPartitionId = 1)
+outputCommitCoordinator.taskCompleted(stage, partition, attemptNumber 
= failedAttempt,
+  reason = ExecutorLostFailure("0", exitCausedByApp = true, None))
+assert(!outputCommitCoordinator.canCommit(stage, partition, 
failedAttempt))
--- End diff --

My understanding is that this change doesn't affect any task that's 
mid-commit, it only changes the pre-commit portion of a task attempt's 
lifecycle to prevent failed tasks (=> heartbeat timeouts as reported by the 
DAGScheduler) from receiving authorization to commit.  Is there a new way which 
two task attempts can be authorized simultaneously that didn't exist before?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16959: [SPARK-19631][CORE] OutputCommitCoordinator should not a...

2017-02-24 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/16959
  
@vanzin are you right person to review this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16959: [SPARK-19631][CORE] OutputCommitCoordinator shoul...

2017-02-17 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16959#discussion_r101762924
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -48,25 +48,28 @@ private[spark] class OutputCommitCoordinator(conf: 
SparkConf, isDriver: Boolean)
   private type StageId = Int
   private type PartitionId = Int
   private type TaskAttemptNumber = Int
+  private case class StageState(authorizedCommitters: 
Array[TaskAttemptNumber],
--- End diff --

can you put a comment here that the index into the `authorizedCommitters` 
array is the partitionId ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16575: [SPARK-19213] DatasourceScanExec uses runtime sparksessi...

2017-01-20 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/16575
  
@hvanhovell does that description make sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16503: [SPARK-18113] Use ask to replace askWithRetry in canComm...

2017-01-13 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/16503
  
Making this idempotent looks great.  I think there's a separate issue with 
this code still not handling poorly-timed preemption, but let's deal with that 
in a separate ticket / PR.

Good work so far @jinxing64 !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16575: [SPARK-19213] DatasourceScanExec uses runtime sparksessi...

2017-01-13 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/16575
  
@hvanhovell the need is also explained in the Jira ticket: 
https://issues.apache.org/jira/browse/SPARK-19213

Does that code snippet make sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16503: [SPARK-18113] Use ask to replace askWithRetry in canComm...

2017-01-12 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/16503
  
@jinxing64 can you please fix the failing Scala style tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16503: [SPARK-18113] Use ask to replace askWithRetry in canComm...

2017-01-12 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/16503
  
You covered my concerns!

I think this will fix some parts of this problem for sure, not sure if it 
covers every possible case though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16503: [SPARK-18113] Use ask to replace askWithRetry in ...

2017-01-12 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16503#discussion_r95660339
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -165,9 +167,14 @@ private[spark] class OutputCommitCoordinator(conf: 
SparkConf, isDriver: Boolean)
 authorizedCommitters(partition) = attemptNumber
 true
   case existingCommitter =>
-logDebug(s"Denying attemptNumber=$attemptNumber to commit for 
stage=$stage, " +
-  s"partition=$partition; existingCommitter = 
$existingCommitter")
-false
+// Coordinator should be idempotent when receiving 
AskPermissionToCommit.
+if (existingCommitter == attemptNumber) {
+  true
--- End diff --

please log a warning here before returning `true` -- reaching this branch 
is likely indicative of network problems

```
logWarning(s"Authorizing duplicate request to commit for " +
  s"attemptNumber=$attemptNumber to commit for stage=$stage, 
partition=$partition; " +
  s"existingCommitter = $existingCommitter. This can indicate 
dropped network traffic.")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16503: [SPARK-18113] Use ask to replace askWithRetry in ...

2017-01-12 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16503#discussion_r95659921
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala
 ---
@@ -221,6 +227,22 @@ private case class OutputCommitFunctions(tempDirPath: 
String) {
   if (ctx.attemptNumber == 0) failingOutputCommitter else 
successfulOutputCommitter)
   }
 
+  // Receiver should be idempotent for AskPermissionToCommitOutput
+  def callCanCommitMultiTimes(iter: Iterator[Int]): Unit = {
--- End diff --

nit rename to `callCanCommitMultipleTimes`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16503: [SPARK-18113] Use ask to replace askWithRetry in ...

2017-01-12 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16503#discussion_r95659396
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala
 ---
@@ -189,6 +188,13 @@ class OutputCommitCoordinatorSuite extends 
SparkFunSuite with BeforeAndAfter {
 assert(
   !outputCommitCoordinator.canCommit(stage, partition, 
nonAuthorizedCommitter + 3))
   }
+
+  test("Authoried commiter get true if it calls canCommit again.") {
--- End diff --

Reword test description to: `Duplicate calls to canCommit from the 
authorized committer gets idempotent responses`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16558: Fix missing close-parens for In filter's toString

2017-01-11 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/16558

Fix missing close-parens for In filter's toString

Otherwise the open parentheses isn't closed in query plan descriptions of 
batch scans.

PushedFilters: [In(COL_A, [1,2,4,6,10,16,219,815], IsNotNull(COL_B), ...

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark patch-9

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16558.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16558


commit a3d737b87eef1ca79a69c4714073edab8ffaadaa
Author: Andrew Ash <and...@andrewash.com>
Date:   2017-01-12T07:53:40Z

Fix missing close-parens for In filter's toString

Otherwise the open parentheses isn't closed in query plan descriptions of 
batch scans.

PushedFilters: [In(COL_A, [1,2,4,6,10,16,219,815], IsNotNull(COL_B), ...




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2016-12-15 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/16281
  
What are the specific patches to parquet that folks are proposing should be 
included in a parquet 1.8.1-spark1 ?  Or what would be desired in a 
parquet-released 1.8.2 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16092: [SPARK-18662] Move resource managers to separate directo...

2016-12-01 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/16092
  
With a k8s backend on the way I do think it adds a nice organization for 
these 3 clearly grouped modules


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16061: [SPARK-18278] [Scheduler] Support native submission of s...

2016-11-29 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/16061
  
Another external scheduler backend I'm aware of is Two Sigma's scheduler 
backend for the system they've created called 
[Cook](https://github.com/twosigma/Cook).  See 
[CoarseCookSchedulerBackend.scala](https://github.com/twosigma/spark/blob/2.0.2-cook/core/src/main/scala/org/apache/spark/scheduler/CoarseCookSchedulerBackend.scala)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15932: [SPARK-18448][CORE] SparkSession should implement java.l...

2016-11-18 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/15932
  
Yep that's precisely what I was envisioning.  Thanks @srowen !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15835: [SPARK-17059][SQL] Allow FileFormat to specify pa...

2016-11-12 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15835#discussion_r87703118
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -703,6 +705,81 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-17059: Allow FileFormat to specify partition pruning 
strategy") {
+withSQLConf(ParquetOutputFormat.ENABLE_JOB_SUMMARY -> "true") {
+  withTempPath { path =>
+spark.sparkContext.parallelize(Seq(1, 2, 3), 3)
+  .toDF("x").write.parquet(path.getCanonicalPath)
+
+val zeroPartitions = 
spark.read.parquet(path.getCanonicalPath).where("x = 0")
+assert(zeroPartitions.rdd.partitions.length == 0)
+
+val onePartition = 
spark.read.parquet(path.getCanonicalPath).where("x = 1")
+assert(onePartition.rdd.partitions.length == 1)
+  }
+}
+  }
+
+  test("Do not filter out parquet file when missing in _metadata file") {
+withSQLConf(ParquetOutputFormat.ENABLE_JOB_SUMMARY -> "true") {
+  withTempPath { path =>
+spark.sparkContext.parallelize(Seq(1, 2, 3), 3)
+  .toDF("x").write.parquet(path.getCanonicalPath)
+spark.sparkContext.parallelize(Seq(4))
+  
.toDF("x").write.mode(SaveMode.Append).parquet(path.getCanonicalPath)
--- End diff --

do you need to turn off `ParquetOutputFormat.ENABLE_JOB_SUMMARY` to prevent 
the `_metadata` file from being updated on the second write, or is it normally 
not updated in `SaveMode.Append` ?

I would've expected the second write here to update the `_metadata` file in 
which case the test name wouldn't match the behavior.  But if the second write 
_doesn't_ update `_metadata` then the file has 3 out of 4 file partitions in 
it, matching the test name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15486: Typo: form -> from

2016-10-14 Thread ash211
GitHub user ash211 opened a pull request:

https://github.com/apache/spark/pull/15486

Typo: form -> from

## What changes were proposed in this pull request?

Minor typo fix

## How was this patch tested?

Existing unit tests on Jenkins

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ash211/spark patch-8

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15486.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15486


commit be8b0b5f39be4389a28768dfa2756da11f87937f
Author: Andrew Ash <and...@andrewash.com>
Date:   2016-10-14T15:09:26Z

Typo: form -> from




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >