[GitHub] spark issue #23086: [SPARK-25528][SQL] data source v2 API refactor (batch re...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23086
  
**[Test build #99428 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99428/testReport)**
 for PR 23086 at commit 
[`38fdac6`](https://github.com/apache/spark/commit/38fdac6677efa4e2d160fd2721d0d65e57bf1f77).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23169
  
**[Test build #99427 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99427/testReport)**
 for PR 23169 at commit 
[`9678799`](https://github.com/apache/spark/commit/9678799f3b203d667c2f4b27dcedd3591606a8cf).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...

2018-11-28 Thread heary-cao
Github user heary-cao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23151#discussion_r237344805
  
--- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala ---
@@ -105,5 +105,16 @@ abstract class SparkFunSuite
   logInfo(s"\n\n= FINISHED $shortSuiteName: '$testName' =\n")
 }
   }
-
+  /**
+   * Creates a temporary directory, which is then passed to `f` and will 
be deleted after `f`
+   * returns.
+   *
+   * @todo Probably this method should be moved to a more general place
+   */
+  protected def withCreateTempDir(f: File => Unit): Unit = {
--- End diff --

I'm not sure if I need `assert(spark.sparkContext.statusTracker  
.getExecutorInfos.map(_.numRunningTasks()).sum == 0)`.  after all, `protected 
def spark: SparkSession`  defined in `SQLTestData`.  Unless we construct one 
`waitForTasksToFinish ` look like
```
protected def waitForTasksToFinish(): Unit = {
eventually(timeout(10.seconds)) {
}
  }
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23169
  
**[Test build #99426 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99426/testReport)**
 for PR 23169 at commit 
[`45a60fc`](https://github.com/apache/spark/commit/45a60fc7f9f4a0c04eae5ae10be68c5aba3dc3e1).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class WriterSizeException(val extraChars: Long, val charLimit: Long) 
extends Exception(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23169
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99426/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23169
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23169
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23169: [SPARK-26103][SQL] Limit the length of debug stri...

2018-11-28 Thread DaveDeCaprio
Github user DaveDeCaprio commented on a diff in the pull request:

https://github.com/apache/spark/pull/23169#discussion_r237344141
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SizeLimitedWriter.scala
 ---
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import java.io.Writer
+
+class WriterSizeException(val attemptedSize: Long, val charLimit: Long) 
extends Exception(
+  s"Attempted to write $attemptedSize characters to a writer that is 
limited to $charLimit")
+
+/**
+ * This class is used to control the size of generated writers.  
Guarantees that the total number
+ * of characters written will be less than the specified size.
+ *
+ * Checks size before writing and throws a WriterSizeException if the 
total size would count the
+ * limit.
+ */
+class SizeLimitedWriter(underlying: Writer, charLimit: Long) extends 
Writer {
+
+  var charsWritten: Long = 0
+
+  override def write(cbuf: Array[Char], off: Int, len: Int): Unit = {
--- End diff --

I changed to writing out to the full limit


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23169: [SPARK-26103][SQL] Limit the length of debug stri...

2018-11-28 Thread DaveDeCaprio
Github user DaveDeCaprio commented on a diff in the pull request:

https://github.com/apache/spark/pull/23169#discussion_r237344154
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SizeLimitedWriter.scala
 ---
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import java.io.Writer
+
+class WriterSizeException(val attemptedSize: Long, val charLimit: Long) 
extends Exception(
+  s"Attempted to write $attemptedSize characters to a writer that is 
limited to $charLimit")
+
+/**
+ * This class is used to control the size of generated writers.  
Guarantees that the total number
+ * of characters written will be less than the specified size.
+ *
+ * Checks size before writing and throws a WriterSizeException if the 
total size would count the
+ * limit.
+ */
+class SizeLimitedWriter(underlying: Writer, charLimit: Long) extends 
Writer {
+
+  var charsWritten: Long = 0
--- End diff --

fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23169
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5498/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23169
  
**[Test build #99426 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99426/testReport)**
 for PR 23169 at commit 
[`45a60fc`](https://github.com/apache/spark/commit/45a60fc7f9f4a0c04eae5ae10be68c5aba3dc3e1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23169: [SPARK-26103][SQL] Limit the length of debug stri...

2018-11-28 Thread DaveDeCaprio
Github user DaveDeCaprio commented on a diff in the pull request:

https://github.com/apache/spark/pull/23169#discussion_r237344026
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/trees/TreeNodeSuite.scala
 ---
@@ -595,4 +596,14 @@ class TreeNodeSuite extends SparkFunSuite {
 val expected = Coalesce(Stream(Literal(1), Literal(3)))
 assert(result === expected)
   }
+
+  test("toString() tree depth") {
--- End diff --

fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23169: [SPARK-26103][SQL] Limit the length of debug stri...

2018-11-28 Thread DaveDeCaprio
Github user DaveDeCaprio commented on a diff in the pull request:

https://github.com/apache/spark/pull/23169#discussion_r237343993
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala ---
@@ -202,6 +202,26 @@ package object util extends Logging {
   /** Shorthand for calling truncatedString() without start or end 
strings. */
   def truncatedString[T](seq: Seq[T], sep: String): String = 
truncatedString(seq, "", sep, "")
 
+  /** Whether we have warned about plan string truncation yet. */
+  private val planSizeWarningPrinted = new AtomicBoolean(false)
+
+  def withSizeLimitedWriter[T](writer: Writer)(f: (Writer) => T): 
Option[T] = {
+try {
+  val limited = new SizeLimitedWriter(writer, 
SQLConf.get.maxPlanStringLength)
+  Some(f(limited))
+}
+catch {
+  case e: WriterSizeException =>
+writer.write("...")
--- End diff --

It was easy enough to do this by dropping the limit by 3 characters.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23086: [SPARK-25528][SQL] data source v2 API refactor (b...

2018-11-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23086#discussion_r237342899
  
--- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java 
---
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.sources.v2.reader.Scan;
+import org.apache.spark.sql.sources.v2.reader.ScanBuilder;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * An interface representing a logical structured data set of a data 
source. For example, the
+ * implementation can be a directory on the file system, a topic of Kafka, 
or a table in the
+ * catalog, etc.
+ * 
+ * This interface can mixin the following interfaces to support different 
operations:
+ * 
+ * 
+ *   {@link SupportsBatchRead}: this table can be read in batch 
queries.
+ * 
+ */
+@Evolving
+public interface Table {
+
+  /**
+   * A name to identify this table.
+   * 
+   * By default this returns the class name of this implementation. Please 
override it to provide a
+   * meaningful name, like the database and table name from catalog, or 
the location of files for
+   * this table.
+   * 
+   */
+  default String name() {
--- End diff --

Do you think it's better to just ask implementations to override 
`toString`? cc @rdblue 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237342362
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -257,16 +289,65 @@ class FileStreamSource(
* equal to `end` and will only request offsets greater than `end` in 
the future.
*/
   override def commit(end: Offset): Unit = {
-// No-op for now; FileStreamSource currently garbage-collects files 
based on timestamp
-// and the value of the maxFileAge parameter.
+def move(entry: FileEntry, baseArchiveDirPath: String): Unit = {
+  val curPath = new Path(entry.path)
--- End diff --

Will address.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237342346
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -100,6 +101,36 @@ class FileStreamSource(
 
   logInfo(s"maxFilesPerBatch = $maxFilesPerBatch, maxFileAgeMs = 
$maxFileAgeMs")
 
+  ensureNoOverlapBetweenSourceAndArchivePath()
+
+  private def ensureNoOverlapBetweenSourceAndArchivePath(): Unit = {
+@tailrec
+def removeGlob(path: Path): Path = {
+  if (path.getName.contains("*")) {
+removeGlob(path.getParent)
+  } else {
+path
+  }
+}
+
+sourceOptions.sourceArchiveDir match {
+  case None =>
+  case Some(archiveDir) =>
+val sourceUri = removeGlob(qualifiedBasePath).toUri
+val archiveUri = new Path(archiveDir).toUri
+
+val sourcePath = sourceUri.getPath
+val archivePath = archiveUri.getPath
--- End diff --

Nice finding. Will address.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237342072
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -257,16 +289,65 @@ class FileStreamSource(
* equal to `end` and will only request offsets greater than `end` in 
the future.
*/
   override def commit(end: Offset): Unit = {
-// No-op for now; FileStreamSource currently garbage-collects files 
based on timestamp
-// and the value of the maxFileAge parameter.
+def move(entry: FileEntry, baseArchiveDirPath: String): Unit = {
+  val curPath = new Path(entry.path)
+  val curPathUri = curPath.toUri
+
+  val newPath = new Path(baseArchiveDirPath + curPathUri.getPath)
+  try {
+logDebug(s"Creating directory if it doesn't exist 
${newPath.getParent}")
+if (!fs.exists(newPath.getParent)) {
+  fs.mkdirs(newPath.getParent)
+}
+
+logDebug(s"Archiving completed file $curPath to $newPath")
+fs.rename(curPath, newPath)
+  } catch {
+case NonFatal(e) =>
+  // Log to error but swallow exception to avoid process being 
stopped
+  logWarning(s"Fail to move $curPath to $newPath / skip moving 
file.", e)
+  }
+}
+
+def remove(entry: FileEntry): Unit = {
+  val curPath = new Path(entry.path)
--- End diff --

Yeah... actually I was somewhat confused I have to escape/unescape for 
path. Thanks for suggestion. Will address and add a new unit test for testing 
it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of `Colum...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23152
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of `Colum...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23152
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5497/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of `Colum...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23152
  
**[Test build #99425 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99425/testReport)**
 for PR 23152 at commit 
[`ca164b6`](https://github.com/apache/spark/commit/ca164b6ca952ba9b72b5578b7868cbd4515789fa).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237341854
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -257,16 +289,65 @@ class FileStreamSource(
* equal to `end` and will only request offsets greater than `end` in 
the future.
*/
   override def commit(end: Offset): Unit = {
-// No-op for now; FileStreamSource currently garbage-collects files 
based on timestamp
-// and the value of the maxFileAge parameter.
+def move(entry: FileEntry, baseArchiveDirPath: String): Unit = {
+  val curPath = new Path(entry.path)
+  val curPathUri = curPath.toUri
+
+  val newPath = new Path(baseArchiveDirPath + curPathUri.getPath)
+  try {
+logDebug(s"Creating directory if it doesn't exist 
${newPath.getParent}")
+if (!fs.exists(newPath.getParent)) {
+  fs.mkdirs(newPath.getParent)
+}
+
+logDebug(s"Archiving completed file $curPath to $newPath")
+fs.rename(curPath, newPath)
+  } catch {
+case NonFatal(e) =>
+  // Log to error but swallow exception to avoid process being 
stopped
+  logWarning(s"Fail to move $curPath to $newPath / skip moving 
file.", e)
+  }
+}
+
+def remove(entry: FileEntry): Unit = {
+  val curPath = new Path(entry.path)
+  try {
+logDebug(s"Removing completed file $curPath")
+fs.delete(curPath, false)
+  } catch {
+case NonFatal(e) =>
+  // Log to error but swallow exception to avoid process being 
stopped
+  logWarning(s"Fail to remove $curPath / skip removing file.", e)
+  }
+}
+
+val logOffset = FileStreamSourceOffset(end).logOffset
+metadataLog.get(logOffset) match {
--- End diff --

Ah I didn't indicate that. Thanks for letting me know! Will address.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237341425
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -257,16 +289,65 @@ class FileStreamSource(
* equal to `end` and will only request offsets greater than `end` in 
the future.
*/
   override def commit(end: Offset): Unit = {
-// No-op for now; FileStreamSource currently garbage-collects files 
based on timestamp
-// and the value of the maxFileAge parameter.
+def move(entry: FileEntry, baseArchiveDirPath: String): Unit = {
+  val curPath = new Path(entry.path)
+  val curPathUri = curPath.toUri
+
+  val newPath = new Path(baseArchiveDirPath + curPathUri.getPath)
+  try {
+logDebug(s"Creating directory if it doesn't exist 
${newPath.getParent}")
+if (!fs.exists(newPath.getParent)) {
+  fs.mkdirs(newPath.getParent)
+}
+
+logDebug(s"Archiving completed file $curPath to $newPath")
+fs.rename(curPath, newPath)
--- End diff --

Yeah, I guess the patch prevents the case if it works like my expectation, 
but I'm also in favor of defensive programming and logging would be better for 
end users. Will address.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237340952
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala
 ---
@@ -74,6 +76,39 @@ class FileStreamOptions(parameters: 
CaseInsensitiveMap[String]) extends Logging
*/
   val fileNameOnly: Boolean = withBooleanParameter("fileNameOnly", false)
 
+  /**
+   * The archive directory to move completed files. The option will be 
only effective when
+   * "cleanSource" is set to "archive".
+   *
+   * Note that the completed file will be moved to this archive directory 
with respecting to
+   * its own path.
+   *
+   * For example, if the path of source file is "/a/b/dataset.txt", and 
the path of archive
+   * directory is "/archived/here", file will be moved to 
"/archived/here/a/b/dataset.txt".
+   */
+  val sourceArchiveDir: Option[String] = parameters.get("sourceArchiveDir")
+
+  /**
+   * Defines how to clean up completed files. Available options are 
"archive", "delete", "no_op".
+   */
+  val cleanSource: CleanSourceMode.Value = {
+val modeStrOption = parameters.getOrElse("cleanSource", 
CleanSourceMode.NO_OP.toString)
--- End diff --

OK will address.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237340938
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -100,6 +101,36 @@ class FileStreamSource(
 
   logInfo(s"maxFilesPerBatch = $maxFilesPerBatch, maxFileAgeMs = 
$maxFileAgeMs")
 
+  ensureNoOverlapBetweenSourceAndArchivePath()
--- End diff --

Ah yes missed it. Will address.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237340601
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -530,6 +530,12 @@ Here are the details of all the sources in Spark.
 "s3://a/dataset.txt"
 "s3n://a/b/dataset.txt"
 "s3a://a/b/c/dataset.txt"
+cleanSource: option to clean up completed files after 
processing.
+Available options are "archive", "delete", "no_op". If the option 
is not provided, the default value is "no_op".
+When "archive" is provided, additional option 
sourceArchiveDir must be provided as well. The value of 
"sourceArchiveDir" must be outside of source path, to ensure archived files are 
never included to new source files again.
+Spark will move source files respecting its own path. For example, 
if the path of source file is "/a/b/dataset.txt" and the path of archive 
directory is "/archived/here", file will be moved to 
"/archived/here/a/b/dataset.txt"
+NOTE: Both archiving (via moving) or deleting completed files 
would introduce overhead (slow down) in each micro-batch, so you need to 
understand the cost for each operation in your file system before enabling this 
option. On the other hand, enbling this option would reduce the cost to list 
source files which is considered as a heavy operation.
+NOTE 2: The source path should not be used from multiple queries 
when enabling this option, because source files will be moved or deleted which 
behavior may impact the other queries.
--- End diff --

Nice finding. I missed the case which multiple sources in same query refer 
same file directory. Will address.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23169
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23169
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99418/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23169
  
**[Test build #99418 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99418/testReport)**
 for PR 23169 at commit 
[`3ffdc6a`](https://github.com/apache/spark/commit/3ffdc6a13370be2f7cd03bea0e48f8e5ef62ccca).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22957
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5496/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22957
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22514
  
Yea, lets see if retest works well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22957
  
**[Test build #99424 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99424/testReport)**
 for PR 22957 at commit 
[`6c93e70`](https://github.com/apache/spark/commit/6c93e708df203cc5a2f3085340ed61bf5c8d90fe).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...

2018-11-28 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22957
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

2018-11-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23165#discussion_r237339152
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
 ---
@@ -49,6 +50,21 @@ class FileIndexSuite extends SharedSQLContext {
 }
   }
 
+  test("SPARK-26188: don't infer data types of partition columns if user 
specifies schema") {
+withTempDir { dir =>
+  val partitionDirectory = new File(dir, s"a=4d")
+  partitionDirectory.mkdir()
+  val file = new File(partitionDirectory, "text.txt")
+  stringToFile(file, "text")
+  val path = new Path(dir.getCanonicalPath)
+  val schema = StructType(Seq(StructField("a", StringType, false)))
+  val catalog = new InMemoryFileIndex(spark, Seq(path), Map.empty, 
Some(schema))
--- End diff --

`catalog` ->`fileIndex`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

2018-11-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23165#discussion_r237339038
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
 ---
@@ -126,13 +126,14 @@ abstract class PartitioningAwareFileIndex(
 val caseInsensitiveOptions = CaseInsensitiveMap(parameters)
 val timeZoneId = 
caseInsensitiveOptions.get(DateTimeUtils.TIMEZONE_OPTION)
   .getOrElse(sparkSession.sessionState.conf.sessionLocalTimeZone)
-val inferredPartitionSpec = PartitioningUtils.parsePartitions(
-  leafDirs,
-  typeInference = 
sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,
-  basePaths = basePaths,
-  timeZoneId = timeZoneId)
+
 userSpecifiedSchema match {
   case Some(userProvidedSchema) if userProvidedSchema.nonEmpty =>
+val inferredPartitionSpec = PartitioningUtils.parsePartitions(
+  leafDirs,
+  typeInference = false,
--- End diff --

can you add some comment, so that we don't make the same mistake in the 
future?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22979: [SPARK-25977][SQL] Parsing decimals from CSV using local...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22979
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99416/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23132
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23132
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99417/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22979: [SPARK-25977][SQL] Parsing decimals from CSV using local...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22979
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...

2018-11-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23151#discussion_r237338904
  
--- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala ---
@@ -105,5 +105,16 @@ abstract class SparkFunSuite
   logInfo(s"\n\n= FINISHED $shortSuiteName: '$testName' =\n")
 }
   }
-
+  /**
+   * Creates a temporary directory, which is then passed to `f` and will 
be deleted after `f`
+   * returns.
+   *
+   * @todo Probably this method should be moved to a more general place
+   */
+  protected def withCreateTempDir(f: File => Unit): Unit = {
--- End diff --

shall we also do `waitForTasksToFinish` in `withCreateTempDir`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22979: [SPARK-25977][SQL] Parsing decimals from CSV using local...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22979
  
**[Test build #99416 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99416/testReport)**
 for PR 22979 at commit 
[`e989b77`](https://github.com/apache/spark/commit/e989b77df5aaced9ffa8212ca2adc1b0d60cfe96).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23132
  
**[Test build #99417 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99417/testReport)**
 for PR 23132 at commit 
[`1ec56e5`](https://github.com/apache/spark/commit/1ec56e53ec7dfc79d4714f55a1166f3b66e6f03b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/23171
  
I'm wondering if this is still useful after we fix the boxing issue in 
`InSet`. We can write a binary hash set for primitive types, like 
`LongToUnsafeRowMap`, which should have better performance.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22514
  
Oh, PySpark UT failures look weird.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL

2018-11-28 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/23128
  
@rxin Thanks for guidance, I'll address these comments in a follow up PR 
soon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-11-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r237337683
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = {
+val tableName1 = "spark_orc1"
+val tableName2 = "spark_orc2"
+
+withTempDir { dir =>
+  val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", 
"c3").repartition(1)
+  withTable(tableName1, tableName2) {
+val dataDir = s"${dir.getCanonicalPath}/dir1/"
+val parentDir = s"${dir.getCanonicalPath}/"
+val wildCardDir = new File(s"${dir}/*").toURI
+someDF1.write.orc(dataDir)
+val parentDirStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName1(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '${parentDir}'""".stripMargin
+sql(parentDirStatement)
+val parentDirSqlStatement = s"select * from ${tableName1}"
+if (isConvertMetastore) {
+  checkAnswer(sql(parentDirSqlStatement), Nil)
+} else {
+ checkAnswer(sql(parentDirSqlStatement),
+   (1 to 2).map(i => Row(i, i, s"orc$i")))
+}
+
+val wildCardStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName2(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '$wildCardDir'""".stripMargin
--- End diff --

I have two suggestions.

1. Is this PR aiming only one-level subdirectories? Could you check the 
behavior on one, two, three level subdirectories in Parquet Hive tables first?
2. Since the test case looks general for both Parquet/ORC, please add a 
test case for Parquet while you are here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...

2018-11-28 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/23132
  
LGTM, does CSV need to do the same?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22514
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5495/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22514
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22514
  
**[Test build #99423 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99423/testReport)**
 for PR 22514 at commit 
[`e42a846`](https://github.com/apache/spark/commit/e42a846f625f2119c6539bf91bf337f0497402ae).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22514
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22514
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5494/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22514
  
retest this please...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22514
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22514
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99415/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...

2018-11-28 Thread heary-cao
Github user heary-cao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23151#discussion_r237333755
  
--- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala ---
@@ -105,5 +105,16 @@ abstract class SparkFunSuite
   logInfo(s"\n\n= FINISHED $shortSuiteName: '$testName' =\n")
 }
   }
-
+  /**
+   * Creates a temporary directory, which is then passed to `f` and will 
be deleted after `f`
+   * returns.
+   *
+   * @todo Probably this method should be moved to a more general place
+   */
+  protected def withCreateTempDir(f: File => Unit): Unit = {
--- End diff --

Currently, `withTempDir` and `withCreateTempDir ` are somewhat different.
**withTempDir**
```
protected def withTempDir(f: File => Unit): Unit = {
val dir = Utils.createTempDir().getCanonicalFile
try f(dir) finally {
  // wait for all tasks to finish before deleting files
  waitForTasksToFinish()
  Utils.deleteRecursively(dir)
}
  }

protected def waitForTasksToFinish(): Unit = {
eventually(timeout(10.seconds)) {
  assert(spark.sparkContext.statusTracker
.getExecutorInfos.map(_.numRunningTasks()).sum == 0)
}
  }
```
**withCreateTempDir**
```
protected def withCreateTempDir(f: File => Unit): Unit = {
val dir = Utils.createTempDir()
try f(dir) finally {
  Utils.deleteRecursively(dir)
}
  }
```
thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22514
  
**[Test build #99415 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99415/testReport)**
 for PR 22514 at commit 
[`e42a846`](https://github.com/apache/spark/commit/e42a846f625f2119c6539bf91bf337f0497402ae).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-28 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r237333561
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
--- End diff --

OK sounds good. Let's merge this one first just as a matter of process.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23031
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5493/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23031
  
**[Test build #99422 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99422/testReport)**
 for PR 23031 at commit 
[`96b2280`](https://github.com/apache/spark/commit/96b228067cd8be3c51a0c36702b493e961bd9702).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23031
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

2018-11-28 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/23100
  
Thanks all! I will create a followup to add alias OneHotEncoderEstimator 
later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...

2018-11-28 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/23031
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-28 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r237332601
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
--- End diff --

The doc change will be in both 2.4 and master, but the R related code will 
be in master only. I think that's why @felixcheung asked me to open a separate 
PR to merge in the doc change for 2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23158: [SPARK-26186][SPARK-26184][CORE] Last updated time is no...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23158
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99409/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23158: [SPARK-26186][SPARK-26184][CORE] Last updated time is no...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23158
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23158: [SPARK-26186][SPARK-26184][CORE] Last updated time is no...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23158
  
**[Test build #99409 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99409/testReport)**
 for PR 23158 at commit 
[`a07aaa4`](https://github.com/apache/spark/commit/a07aaa44987d6c5a8281d320a0804d6354935a58).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir functi...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23151
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir functi...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23151
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5492/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir functi...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23151
  
**[Test build #99421 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99421/testReport)**
 for PR 23151 at commit 
[`d7482a3`](https://github.com/apache/spark/commit/d7482a352dc4e6b0110443b20246facc180fda34).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-28 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r237330636
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
--- End diff --

Pardon, I'm catching up -- why just commit this doc to 2.4 and not master?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

2018-11-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/23100


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23168: [SPARK-26207][doc]add PowerIterationClustering (PIC) doc...

2018-11-28 Thread huaxingao
Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/23168
  
@srowen It's not in master yet. The PR is here
https://github.com/apache/spark/pull/23072


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

2018-11-28 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/23100
  
I went through the PR again, and it looks right to me. Merged into master. 
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

2018-11-28 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/23100
  
Yes, this diff just reflect renaming OneHotEncoderEstimator to 
OneHotEncoder. Besides that, this also changes related documents and example 
codes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23052
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99407/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23174: [SPARK-26194][k8s] Auto generate auth secret for k8s app...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23174
  
**[Test build #99402 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99402/testReport)**
 for PR 23174 at commit 
[`0e36a4b`](https://github.com/apache/spark/commit/0e36a4bb4a5a1ad9abee7e003b7d5f3588cba126).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23174: [SPARK-26194][k8s] Auto generate auth secret for k8s app...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23174
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23052
  
**[Test build #99407 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99407/testReport)**
 for PR 23052 at commit 
[`083d411`](https://github.com/apache/spark/commit/083d411ec1822986dbb82fbe1896a6c0d846c7d8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23174: [SPARK-26194][k8s] Auto generate auth secret for k8s app...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23174
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99402/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23052
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23168: [SPARK-26207][doc]add PowerIterationClustering (PIC) doc...

2018-11-28 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/23168
  
Pardon, was this also already added to master? I don't see it but I could 
be missing something.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23173
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23173
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99405/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23173
  
**[Test build #99405 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99405/testReport)**
 for PR 23173 at commit 
[`bfadbf9`](https://github.com/apache/spark/commit/bfadbf9ae5df8e4a1e845c82dddc91d5082f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23172: [SPARK-25957][followup] Build python docker image in sbt...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23172
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23172: [SPARK-25957][followup] Build python docker image in sbt...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23172
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99406/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23172: [SPARK-25957][followup] Build python docker image in sbt...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23172
  
**[Test build #99406 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99406/testReport)**
 for PR 23172 at commit 
[`b49cfde`](https://github.com/apache/spark/commit/b49cfde2211ee30c67f8ec37b6329c854f798438).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237319928
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -257,16 +289,65 @@ class FileStreamSource(
* equal to `end` and will only request offsets greater than `end` in 
the future.
*/
   override def commit(end: Offset): Unit = {
-// No-op for now; FileStreamSource currently garbage-collects files 
based on timestamp
-// and the value of the maxFileAge parameter.
+def move(entry: FileEntry, baseArchiveDirPath: String): Unit = {
+  val curPath = new Path(entry.path)
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237314690
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -530,6 +530,12 @@ Here are the details of all the sources in Spark.
 "s3://a/dataset.txt"
 "s3n://a/b/dataset.txt"
 "s3a://a/b/c/dataset.txt"
+cleanSource: option to clean up completed files after 
processing.
+Available options are "archive", "delete", "no_op". If the option 
is not provided, the default value is "no_op".
+When "archive" is provided, additional option 
sourceArchiveDir must be provided as well. The value of 
"sourceArchiveDir" must be outside of source path, to ensure archived files are 
never included to new source files again.
+Spark will move source files respecting its own path. For example, 
if the path of source file is "/a/b/dataset.txt" and the path of archive 
directory is "/archived/here", file will be moved to 
"/archived/here/a/b/dataset.txt"
+NOTE: Both archiving (via moving) or deleting completed files 
would introduce overhead (slow down) in each micro-batch, so you need to 
understand the cost for each operation in your file system before enabling this 
option. On the other hand, enbling this option would reduce the cost to list 
source files which is considered as a heavy operation.
+NOTE 2: The source path should not be used from multiple queries 
when enabling this option, because source files will be moved or deleted which 
behavior may impact the other queries.
--- End diff --

NOTE 3: Both delete and move actions are best effort. Failing to delete or 
move files will not fail the streaming query.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keyt...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22911
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237319903
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -257,16 +289,65 @@ class FileStreamSource(
* equal to `end` and will only request offsets greater than `end` in 
the future.
*/
   override def commit(end: Offset): Unit = {
-// No-op for now; FileStreamSource currently garbage-collects files 
based on timestamp
-// and the value of the maxFileAge parameter.
+def move(entry: FileEntry, baseArchiveDirPath: String): Unit = {
+  val curPath = new Path(entry.path)
+  val curPathUri = curPath.toUri
+
+  val newPath = new Path(baseArchiveDirPath + curPathUri.getPath)
+  try {
+logDebug(s"Creating directory if it doesn't exist 
${newPath.getParent}")
+if (!fs.exists(newPath.getParent)) {
+  fs.mkdirs(newPath.getParent)
+}
+
+logDebug(s"Archiving completed file $curPath to $newPath")
+fs.rename(curPath, newPath)
+  } catch {
+case NonFatal(e) =>
+  // Log to error but swallow exception to avoid process being 
stopped
+  logWarning(s"Fail to move $curPath to $newPath / skip moving 
file.", e)
+  }
+}
+
+def remove(entry: FileEntry): Unit = {
+  val curPath = new Path(entry.path)
--- End diff --

`val curPath = new Path(new URI(entry.path))` to make it escape/unescape 
path properly. `entry.path` was created from `Path.toUri.toString`. Could you 
also add a unit test to test special paths such as `/a/b/a b%.txt`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237319176
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -257,16 +289,65 @@ class FileStreamSource(
* equal to `end` and will only request offsets greater than `end` in 
the future.
*/
   override def commit(end: Offset): Unit = {
-// No-op for now; FileStreamSource currently garbage-collects files 
based on timestamp
-// and the value of the maxFileAge parameter.
+def move(entry: FileEntry, baseArchiveDirPath: String): Unit = {
+  val curPath = new Path(entry.path)
+  val curPathUri = curPath.toUri
+
+  val newPath = new Path(baseArchiveDirPath + curPathUri.getPath)
+  try {
+logDebug(s"Creating directory if it doesn't exist 
${newPath.getParent}")
+if (!fs.exists(newPath.getParent)) {
+  fs.mkdirs(newPath.getParent)
+}
+
+logDebug(s"Archiving completed file $curPath to $newPath")
+fs.rename(curPath, newPath)
+  } catch {
+case NonFatal(e) =>
+  // Log to error but swallow exception to avoid process being 
stopped
+  logWarning(s"Fail to move $curPath to $newPath / skip moving 
file.", e)
+  }
+}
+
+def remove(entry: FileEntry): Unit = {
+  val curPath = new Path(entry.path)
+  try {
+logDebug(s"Removing completed file $curPath")
+fs.delete(curPath, false)
+  } catch {
+case NonFatal(e) =>
+  // Log to error but swallow exception to avoid process being 
stopped
+  logWarning(s"Fail to remove $curPath / skip removing file.", e)
+  }
+}
+
+val logOffset = FileStreamSourceOffset(end).logOffset
+metadataLog.get(logOffset) match {
--- End diff --

you can use `val files = metadataLog.get(Some(logOffset), 
Some(logOffset)).flatMap(_._2)` to use the underlying cache in 
FileStreamSourceLog.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237200636
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala
 ---
@@ -74,6 +76,39 @@ class FileStreamOptions(parameters: 
CaseInsensitiveMap[String]) extends Logging
*/
   val fileNameOnly: Boolean = withBooleanParameter("fileNameOnly", false)
 
+  /**
+   * The archive directory to move completed files. The option will be 
only effective when
+   * "cleanSource" is set to "archive".
+   *
+   * Note that the completed file will be moved to this archive directory 
with respecting to
+   * its own path.
+   *
+   * For example, if the path of source file is "/a/b/dataset.txt", and 
the path of archive
+   * directory is "/archived/here", file will be moved to 
"/archived/here/a/b/dataset.txt".
+   */
+  val sourceArchiveDir: Option[String] = parameters.get("sourceArchiveDir")
+
+  /**
+   * Defines how to clean up completed files. Available options are 
"archive", "delete", "no_op".
+   */
+  val cleanSource: CleanSourceMode.Value = {
+val modeStrOption = parameters.getOrElse("cleanSource", 
CleanSourceMode.NO_OP.toString)
--- End diff --

nit: could you create a method to `CleanSourceMode` to convert a string to 
`CleanSourceMode.Value`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keyt...

2018-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22911
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99404/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237315173
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -100,6 +101,36 @@ class FileStreamSource(
 
   logInfo(s"maxFilesPerBatch = $maxFilesPerBatch, maxFileAgeMs = 
$maxFileAgeMs")
 
+  ensureNoOverlapBetweenSourceAndArchivePath()
--- End diff --

Could you do this check only when CleanSourceMode is `ARCHIVE`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237315718
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -257,16 +289,65 @@ class FileStreamSource(
* equal to `end` and will only request offsets greater than `end` in 
the future.
*/
   override def commit(end: Offset): Unit = {
-// No-op for now; FileStreamSource currently garbage-collects files 
based on timestamp
-// and the value of the maxFileAge parameter.
+def move(entry: FileEntry, baseArchiveDirPath: String): Unit = {
+  val curPath = new Path(entry.path)
+  val curPathUri = curPath.toUri
+
+  val newPath = new Path(baseArchiveDirPath + curPathUri.getPath)
+  try {
+logDebug(s"Creating directory if it doesn't exist 
${newPath.getParent}")
+if (!fs.exists(newPath.getParent)) {
+  fs.mkdirs(newPath.getParent)
+}
+
+logDebug(s"Archiving completed file $curPath to $newPath")
+fs.rename(curPath, newPath)
--- End diff --

It's better to also check the return value of `rename`. A user may reuse a 
source archive dir and cause path conflicts. We should also log this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237320515
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -100,6 +101,36 @@ class FileStreamSource(
 
   logInfo(s"maxFilesPerBatch = $maxFilesPerBatch, maxFileAgeMs = 
$maxFileAgeMs")
 
+  ensureNoOverlapBetweenSourceAndArchivePath()
+
+  private def ensureNoOverlapBetweenSourceAndArchivePath(): Unit = {
+@tailrec
+def removeGlob(path: Path): Path = {
+  if (path.getName.contains("*")) {
+removeGlob(path.getParent)
+  } else {
+path
+  }
+}
+
+sourceOptions.sourceArchiveDir match {
+  case None =>
+  case Some(archiveDir) =>
+val sourceUri = removeGlob(qualifiedBasePath).toUri
+val archiveUri = new Path(archiveDir).toUri
+
+val sourcePath = sourceUri.getPath
+val archivePath = archiveUri.getPath
--- End diff --

we need to use `fs.makeQualified` to turn all user provided paths to 
absolute paths as the user may just pass a relative path.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...

2018-11-28 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r237314459
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -530,6 +530,12 @@ Here are the details of all the sources in Spark.
 "s3://a/dataset.txt"
 "s3n://a/b/dataset.txt"
 "s3a://a/b/c/dataset.txt"
+cleanSource: option to clean up completed files after 
processing.
+Available options are "archive", "delete", "no_op". If the option 
is not provided, the default value is "no_op".
+When "archive" is provided, additional option 
sourceArchiveDir must be provided as well. The value of 
"sourceArchiveDir" must be outside of source path, to ensure archived files are 
never included to new source files again.
+Spark will move source files respecting its own path. For example, 
if the path of source file is "/a/b/dataset.txt" and the path of archive 
directory is "/archived/here", file will be moved to 
"/archived/here/a/b/dataset.txt"
+NOTE: Both archiving (via moving) or deleting completed files 
would introduce overhead (slow down) in each micro-batch, so you need to 
understand the cost for each operation in your file system before enabling this 
option. On the other hand, enbling this option would reduce the cost to list 
source files which is considered as a heavy operation.
+NOTE 2: The source path should not be used from multiple queries 
when enabling this option, because source files will be moved or deleted which 
behavior may impact the other queries.
--- End diff --

NOTE 2: The source path should not be used from multiple **sources or** 
queries when enabling this option, because source files will be moved or 
deleted which behavior may impact the other **sources and** queries.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keyt...

2018-11-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22911
  
**[Test build #99404 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99404/testReport)**
 for PR 22911 at commit 
[`88f1bb5`](https://github.com/apache/spark/commit/88f1bb516fa6098125fb7f77fd548a1d423cf745).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   >