[GitHub] spark issue #23086: [SPARK-25528][SQL] data source v2 API refactor (batch re...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23086 **[Test build #99428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99428/testReport)** for PR 23086 at commit [`38fdac6`](https://github.com/apache/spark/commit/38fdac6677efa4e2d160fd2721d0d65e57bf1f77). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23169 **[Test build #99427 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99427/testReport)** for PR 23169 at commit [`9678799`](https://github.com/apache/spark/commit/9678799f3b203d667c2f4b27dcedd3591606a8cf). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...
Github user heary-cao commented on a diff in the pull request: https://github.com/apache/spark/pull/23151#discussion_r237344805 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -105,5 +105,16 @@ abstract class SparkFunSuite logInfo(s"\n\n= FINISHED $shortSuiteName: '$testName' =\n") } } - + /** + * Creates a temporary directory, which is then passed to `f` and will be deleted after `f` + * returns. + * + * @todo Probably this method should be moved to a more general place + */ + protected def withCreateTempDir(f: File => Unit): Unit = { --- End diff -- I'm not sure if I need `assert(spark.sparkContext.statusTracker .getExecutorInfos.map(_.numRunningTasks()).sum == 0)`. after all, `protected def spark: SparkSession` defined in `SQLTestData`. Unless we construct one `waitForTasksToFinish ` look like ``` protected def waitForTasksToFinish(): Unit = { eventually(timeout(10.seconds)) { } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23169 **[Test build #99426 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99426/testReport)** for PR 23169 at commit [`45a60fc`](https://github.com/apache/spark/commit/45a60fc7f9f4a0c04eae5ae10be68c5aba3dc3e1). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class WriterSizeException(val extraChars: Long, val charLimit: Long) extends Exception(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23169 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99426/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23169 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23169 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23169: [SPARK-26103][SQL] Limit the length of debug stri...
Github user DaveDeCaprio commented on a diff in the pull request: https://github.com/apache/spark/pull/23169#discussion_r237344141 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SizeLimitedWriter.scala --- @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import java.io.Writer + +class WriterSizeException(val attemptedSize: Long, val charLimit: Long) extends Exception( + s"Attempted to write $attemptedSize characters to a writer that is limited to $charLimit") + +/** + * This class is used to control the size of generated writers. Guarantees that the total number + * of characters written will be less than the specified size. + * + * Checks size before writing and throws a WriterSizeException if the total size would count the + * limit. + */ +class SizeLimitedWriter(underlying: Writer, charLimit: Long) extends Writer { + + var charsWritten: Long = 0 + + override def write(cbuf: Array[Char], off: Int, len: Int): Unit = { --- End diff -- I changed to writing out to the full limit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23169: [SPARK-26103][SQL] Limit the length of debug stri...
Github user DaveDeCaprio commented on a diff in the pull request: https://github.com/apache/spark/pull/23169#discussion_r237344154 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SizeLimitedWriter.scala --- @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import java.io.Writer + +class WriterSizeException(val attemptedSize: Long, val charLimit: Long) extends Exception( + s"Attempted to write $attemptedSize characters to a writer that is limited to $charLimit") + +/** + * This class is used to control the size of generated writers. Guarantees that the total number + * of characters written will be less than the specified size. + * + * Checks size before writing and throws a WriterSizeException if the total size would count the + * limit. + */ +class SizeLimitedWriter(underlying: Writer, charLimit: Long) extends Writer { + + var charsWritten: Long = 0 --- End diff -- fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23169 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5498/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23169 **[Test build #99426 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99426/testReport)** for PR 23169 at commit [`45a60fc`](https://github.com/apache/spark/commit/45a60fc7f9f4a0c04eae5ae10be68c5aba3dc3e1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23169: [SPARK-26103][SQL] Limit the length of debug stri...
Github user DaveDeCaprio commented on a diff in the pull request: https://github.com/apache/spark/pull/23169#discussion_r237344026 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/trees/TreeNodeSuite.scala --- @@ -595,4 +596,14 @@ class TreeNodeSuite extends SparkFunSuite { val expected = Coalesce(Stream(Literal(1), Literal(3))) assert(result === expected) } + + test("toString() tree depth") { --- End diff -- fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23169: [SPARK-26103][SQL] Limit the length of debug stri...
Github user DaveDeCaprio commented on a diff in the pull request: https://github.com/apache/spark/pull/23169#discussion_r237343993 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala --- @@ -202,6 +202,26 @@ package object util extends Logging { /** Shorthand for calling truncatedString() without start or end strings. */ def truncatedString[T](seq: Seq[T], sep: String): String = truncatedString(seq, "", sep, "") + /** Whether we have warned about plan string truncation yet. */ + private val planSizeWarningPrinted = new AtomicBoolean(false) + + def withSizeLimitedWriter[T](writer: Writer)(f: (Writer) => T): Option[T] = { +try { + val limited = new SizeLimitedWriter(writer, SQLConf.get.maxPlanStringLength) + Some(f(limited)) +} +catch { + case e: WriterSizeException => +writer.write("...") --- End diff -- It was easy enough to do this by dropping the limit by 3 characters. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23086: [SPARK-25528][SQL] data source v2 API refactor (b...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23086#discussion_r237342899 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java --- @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2; + +import org.apache.spark.annotation.Evolving; +import org.apache.spark.sql.sources.v2.reader.Scan; +import org.apache.spark.sql.sources.v2.reader.ScanBuilder; +import org.apache.spark.sql.types.StructType; + +/** + * An interface representing a logical structured data set of a data source. For example, the + * implementation can be a directory on the file system, a topic of Kafka, or a table in the + * catalog, etc. + * + * This interface can mixin the following interfaces to support different operations: + * + * + * {@link SupportsBatchRead}: this table can be read in batch queries. + * + */ +@Evolving +public interface Table { + + /** + * A name to identify this table. + * + * By default this returns the class name of this implementation. Please override it to provide a + * meaningful name, like the database and table name from catalog, or the location of files for + * this table. + * + */ + default String name() { --- End diff -- Do you think it's better to just ask implementations to override `toString`? cc @rdblue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237342362 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -257,16 +289,65 @@ class FileStreamSource( * equal to `end` and will only request offsets greater than `end` in the future. */ override def commit(end: Offset): Unit = { -// No-op for now; FileStreamSource currently garbage-collects files based on timestamp -// and the value of the maxFileAge parameter. +def move(entry: FileEntry, baseArchiveDirPath: String): Unit = { + val curPath = new Path(entry.path) --- End diff -- Will address. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237342346 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -100,6 +101,36 @@ class FileStreamSource( logInfo(s"maxFilesPerBatch = $maxFilesPerBatch, maxFileAgeMs = $maxFileAgeMs") + ensureNoOverlapBetweenSourceAndArchivePath() + + private def ensureNoOverlapBetweenSourceAndArchivePath(): Unit = { +@tailrec +def removeGlob(path: Path): Path = { + if (path.getName.contains("*")) { +removeGlob(path.getParent) + } else { +path + } +} + +sourceOptions.sourceArchiveDir match { + case None => + case Some(archiveDir) => +val sourceUri = removeGlob(qualifiedBasePath).toUri +val archiveUri = new Path(archiveDir).toUri + +val sourcePath = sourceUri.getPath +val archivePath = archiveUri.getPath --- End diff -- Nice finding. Will address. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237342072 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -257,16 +289,65 @@ class FileStreamSource( * equal to `end` and will only request offsets greater than `end` in the future. */ override def commit(end: Offset): Unit = { -// No-op for now; FileStreamSource currently garbage-collects files based on timestamp -// and the value of the maxFileAge parameter. +def move(entry: FileEntry, baseArchiveDirPath: String): Unit = { + val curPath = new Path(entry.path) + val curPathUri = curPath.toUri + + val newPath = new Path(baseArchiveDirPath + curPathUri.getPath) + try { +logDebug(s"Creating directory if it doesn't exist ${newPath.getParent}") +if (!fs.exists(newPath.getParent)) { + fs.mkdirs(newPath.getParent) +} + +logDebug(s"Archiving completed file $curPath to $newPath") +fs.rename(curPath, newPath) + } catch { +case NonFatal(e) => + // Log to error but swallow exception to avoid process being stopped + logWarning(s"Fail to move $curPath to $newPath / skip moving file.", e) + } +} + +def remove(entry: FileEntry): Unit = { + val curPath = new Path(entry.path) --- End diff -- Yeah... actually I was somewhat confused I have to escape/unescape for path. Thanks for suggestion. Will address and add a new unit test for testing it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of `Colum...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23152 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of `Colum...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23152 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5497/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of `Colum...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23152 **[Test build #99425 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99425/testReport)** for PR 23152 at commit [`ca164b6`](https://github.com/apache/spark/commit/ca164b6ca952ba9b72b5578b7868cbd4515789fa). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237341854 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -257,16 +289,65 @@ class FileStreamSource( * equal to `end` and will only request offsets greater than `end` in the future. */ override def commit(end: Offset): Unit = { -// No-op for now; FileStreamSource currently garbage-collects files based on timestamp -// and the value of the maxFileAge parameter. +def move(entry: FileEntry, baseArchiveDirPath: String): Unit = { + val curPath = new Path(entry.path) + val curPathUri = curPath.toUri + + val newPath = new Path(baseArchiveDirPath + curPathUri.getPath) + try { +logDebug(s"Creating directory if it doesn't exist ${newPath.getParent}") +if (!fs.exists(newPath.getParent)) { + fs.mkdirs(newPath.getParent) +} + +logDebug(s"Archiving completed file $curPath to $newPath") +fs.rename(curPath, newPath) + } catch { +case NonFatal(e) => + // Log to error but swallow exception to avoid process being stopped + logWarning(s"Fail to move $curPath to $newPath / skip moving file.", e) + } +} + +def remove(entry: FileEntry): Unit = { + val curPath = new Path(entry.path) + try { +logDebug(s"Removing completed file $curPath") +fs.delete(curPath, false) + } catch { +case NonFatal(e) => + // Log to error but swallow exception to avoid process being stopped + logWarning(s"Fail to remove $curPath / skip removing file.", e) + } +} + +val logOffset = FileStreamSourceOffset(end).logOffset +metadataLog.get(logOffset) match { --- End diff -- Ah I didn't indicate that. Thanks for letting me know! Will address. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237341425 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -257,16 +289,65 @@ class FileStreamSource( * equal to `end` and will only request offsets greater than `end` in the future. */ override def commit(end: Offset): Unit = { -// No-op for now; FileStreamSource currently garbage-collects files based on timestamp -// and the value of the maxFileAge parameter. +def move(entry: FileEntry, baseArchiveDirPath: String): Unit = { + val curPath = new Path(entry.path) + val curPathUri = curPath.toUri + + val newPath = new Path(baseArchiveDirPath + curPathUri.getPath) + try { +logDebug(s"Creating directory if it doesn't exist ${newPath.getParent}") +if (!fs.exists(newPath.getParent)) { + fs.mkdirs(newPath.getParent) +} + +logDebug(s"Archiving completed file $curPath to $newPath") +fs.rename(curPath, newPath) --- End diff -- Yeah, I guess the patch prevents the case if it works like my expectation, but I'm also in favor of defensive programming and logging would be better for end users. Will address. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237340952 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala --- @@ -74,6 +76,39 @@ class FileStreamOptions(parameters: CaseInsensitiveMap[String]) extends Logging */ val fileNameOnly: Boolean = withBooleanParameter("fileNameOnly", false) + /** + * The archive directory to move completed files. The option will be only effective when + * "cleanSource" is set to "archive". + * + * Note that the completed file will be moved to this archive directory with respecting to + * its own path. + * + * For example, if the path of source file is "/a/b/dataset.txt", and the path of archive + * directory is "/archived/here", file will be moved to "/archived/here/a/b/dataset.txt". + */ + val sourceArchiveDir: Option[String] = parameters.get("sourceArchiveDir") + + /** + * Defines how to clean up completed files. Available options are "archive", "delete", "no_op". + */ + val cleanSource: CleanSourceMode.Value = { +val modeStrOption = parameters.getOrElse("cleanSource", CleanSourceMode.NO_OP.toString) --- End diff -- OK will address. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237340938 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -100,6 +101,36 @@ class FileStreamSource( logInfo(s"maxFilesPerBatch = $maxFilesPerBatch, maxFileAgeMs = $maxFileAgeMs") + ensureNoOverlapBetweenSourceAndArchivePath() --- End diff -- Ah yes missed it. Will address. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237340601 --- Diff: docs/structured-streaming-programming-guide.md --- @@ -530,6 +530,12 @@ Here are the details of all the sources in Spark. "s3://a/dataset.txt" "s3n://a/b/dataset.txt" "s3a://a/b/c/dataset.txt" +cleanSource: option to clean up completed files after processing. +Available options are "archive", "delete", "no_op". If the option is not provided, the default value is "no_op". +When "archive" is provided, additional option sourceArchiveDir must be provided as well. The value of "sourceArchiveDir" must be outside of source path, to ensure archived files are never included to new source files again. +Spark will move source files respecting its own path. For example, if the path of source file is "/a/b/dataset.txt" and the path of archive directory is "/archived/here", file will be moved to "/archived/here/a/b/dataset.txt" +NOTE: Both archiving (via moving) or deleting completed files would introduce overhead (slow down) in each micro-batch, so you need to understand the cost for each operation in your file system before enabling this option. On the other hand, enbling this option would reduce the cost to list source files which is considered as a heavy operation. +NOTE 2: The source path should not be used from multiple queries when enabling this option, because source files will be moved or deleted which behavior may impact the other queries. --- End diff -- Nice finding. I missed the case which multiple sources in same query refer same file directory. Will address. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23169 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23169 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99418/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23169 **[Test build #99418 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99418/testReport)** for PR 23169 at commit [`3ffdc6a`](https://github.com/apache/spark/commit/3ffdc6a13370be2f7cd03bea0e48f8e5ef62ccca). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22957 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5496/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22957 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/22514 Yea, lets see if retest works well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22957 **[Test build #99424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99424/testReport)** for PR 22957 at commit [`6c93e70`](https://github.com/apache/spark/commit/6c93e708df203cc5a2f3085340ed61bf5c8d90fe). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22957 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23165#discussion_r237339152 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala --- @@ -49,6 +50,21 @@ class FileIndexSuite extends SharedSQLContext { } } + test("SPARK-26188: don't infer data types of partition columns if user specifies schema") { +withTempDir { dir => + val partitionDirectory = new File(dir, s"a=4d") + partitionDirectory.mkdir() + val file = new File(partitionDirectory, "text.txt") + stringToFile(file, "text") + val path = new Path(dir.getCanonicalPath) + val schema = StructType(Seq(StructField("a", StringType, false))) + val catalog = new InMemoryFileIndex(spark, Seq(path), Map.empty, Some(schema)) --- End diff -- `catalog` ->`fileIndex` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23165#discussion_r237339038 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala --- @@ -126,13 +126,14 @@ abstract class PartitioningAwareFileIndex( val caseInsensitiveOptions = CaseInsensitiveMap(parameters) val timeZoneId = caseInsensitiveOptions.get(DateTimeUtils.TIMEZONE_OPTION) .getOrElse(sparkSession.sessionState.conf.sessionLocalTimeZone) -val inferredPartitionSpec = PartitioningUtils.parsePartitions( - leafDirs, - typeInference = sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled, - basePaths = basePaths, - timeZoneId = timeZoneId) + userSpecifiedSchema match { case Some(userProvidedSchema) if userProvidedSchema.nonEmpty => +val inferredPartitionSpec = PartitioningUtils.parsePartitions( + leafDirs, + typeInference = false, --- End diff -- can you add some comment, so that we don't make the same mistake in the future? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22979: [SPARK-25977][SQL] Parsing decimals from CSV using local...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22979 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99416/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23132 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23132 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99417/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22979: [SPARK-25977][SQL] Parsing decimals from CSV using local...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22979 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23151#discussion_r237338904 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -105,5 +105,16 @@ abstract class SparkFunSuite logInfo(s"\n\n= FINISHED $shortSuiteName: '$testName' =\n") } } - + /** + * Creates a temporary directory, which is then passed to `f` and will be deleted after `f` + * returns. + * + * @todo Probably this method should be moved to a more general place + */ + protected def withCreateTempDir(f: File => Unit): Unit = { --- End diff -- shall we also do `waitForTasksToFinish` in `withCreateTempDir`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22979: [SPARK-25977][SQL] Parsing decimals from CSV using local...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22979 **[Test build #99416 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99416/testReport)** for PR 22979 at commit [`e989b77`](https://github.com/apache/spark/commit/e989b77df5aaced9ffa8212ca2adc1b0d60cfe96). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23132 **[Test build #99417 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99417/testReport)** for PR 23132 at commit [`1ec56e5`](https://github.com/apache/spark/commit/1ec56e53ec7dfc79d4714f55a1166f3b66e6f03b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23171 I'm wondering if this is still useful after we fix the boxing issue in `InSet`. We can write a binary hash set for primitive types, like `LongToUnsafeRowMap`, which should have better performance. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22514 Oh, PySpark UT failures look weird. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/23128 @rxin Thanks for guidance, I'll address these comments in a follow up PR soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r237337683 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = { +val tableName1 = "spark_orc1" +val tableName2 = "spark_orc2" + +withTempDir { dir => + val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", "c3").repartition(1) + withTable(tableName1, tableName2) { +val dataDir = s"${dir.getCanonicalPath}/dir1/" +val parentDir = s"${dir.getCanonicalPath}/" +val wildCardDir = new File(s"${dir}/*").toURI +someDF1.write.orc(dataDir) +val parentDirStatement = + s""" + |CREATE EXTERNAL TABLE $tableName1( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '${parentDir}'""".stripMargin +sql(parentDirStatement) +val parentDirSqlStatement = s"select * from ${tableName1}" +if (isConvertMetastore) { + checkAnswer(sql(parentDirSqlStatement), Nil) +} else { + checkAnswer(sql(parentDirSqlStatement), + (1 to 2).map(i => Row(i, i, s"orc$i"))) +} + +val wildCardStatement = + s""" + |CREATE EXTERNAL TABLE $tableName2( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '$wildCardDir'""".stripMargin --- End diff -- I have two suggestions. 1. Is this PR aiming only one-level subdirectories? Could you check the behavior on one, two, three level subdirectories in Parquet Hive tables first? 2. Since the test case looks general for both Parquet/ORC, please add a test case for Parquet while you are here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23132 LGTM, does CSV need to do the same? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22514 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5495/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22514 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22514 **[Test build #99423 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99423/testReport)** for PR 22514 at commit [`e42a846`](https://github.com/apache/spark/commit/e42a846f625f2119c6539bf91bf337f0497402ae). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22514 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22514 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5494/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/22514 retest this please... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22514 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22514 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99415/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...
Github user heary-cao commented on a diff in the pull request: https://github.com/apache/spark/pull/23151#discussion_r237333755 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -105,5 +105,16 @@ abstract class SparkFunSuite logInfo(s"\n\n= FINISHED $shortSuiteName: '$testName' =\n") } } - + /** + * Creates a temporary directory, which is then passed to `f` and will be deleted after `f` + * returns. + * + * @todo Probably this method should be moved to a more general place + */ + protected def withCreateTempDir(f: File => Unit): Unit = { --- End diff -- Currently, `withTempDir` and `withCreateTempDir ` are somewhat different. **withTempDir** ``` protected def withTempDir(f: File => Unit): Unit = { val dir = Utils.createTempDir().getCanonicalFile try f(dir) finally { // wait for all tasks to finish before deleting files waitForTasksToFinish() Utils.deleteRecursively(dir) } } protected def waitForTasksToFinish(): Unit = { eventually(timeout(10.seconds)) { assert(spark.sparkContext.statusTracker .getExecutorInfos.map(_.numRunningTasks()).sum == 0) } } ``` **withCreateTempDir** ``` protected def withCreateTempDir(f: File => Unit): Unit = { val dir = Utils.createTempDir() try f(dir) finally { Utils.deleteRecursively(dir) } } ``` thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22514 **[Test build #99415 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99415/testReport)** for PR 22514 at commit [`e42a846`](https://github.com/apache/spark/commit/e42a846f625f2119c6539bf91bf337f0497402ae). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r237333561 --- Diff: docs/ml-clustering.md --- @@ -265,3 +265,44 @@ Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details. + +## Power Iteration Clustering (PIC) + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm --- End diff -- OK sounds good. Let's merge this one first just as a matter of process. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23031 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5493/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23031 **[Test build #99422 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99422/testReport)** for PR 23031 at commit [`96b2280`](https://github.com/apache/spark/commit/96b228067cd8be3c51a0c36702b493e961bd9702). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23031 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/23100 Thanks all! I will create a followup to add alias OneHotEncoderEstimator later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/23031 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r237332601 --- Diff: docs/ml-clustering.md --- @@ -265,3 +265,44 @@ Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details. + +## Power Iteration Clustering (PIC) + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm --- End diff -- The doc change will be in both 2.4 and master, but the R related code will be in master only. I think that's why @felixcheung asked me to open a separate PR to merge in the doc change for 2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23158: [SPARK-26186][SPARK-26184][CORE] Last updated time is no...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23158 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99409/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23158: [SPARK-26186][SPARK-26184][CORE] Last updated time is no...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23158 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23158: [SPARK-26186][SPARK-26184][CORE] Last updated time is no...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23158 **[Test build #99409 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99409/testReport)** for PR 23158 at commit [`a07aaa4`](https://github.com/apache/spark/commit/a07aaa44987d6c5a8281d320a0804d6354935a58). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir functi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23151 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir functi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23151 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5492/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir functi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23151 **[Test build #99421 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99421/testReport)** for PR 23151 at commit [`d7482a3`](https://github.com/apache/spark/commit/d7482a352dc4e6b0110443b20246facc180fda34). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r237330636 --- Diff: docs/ml-clustering.md --- @@ -265,3 +265,44 @@ Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details. + +## Power Iteration Clustering (PIC) + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm --- End diff -- Pardon, I'm catching up -- why just commit this doc to 2.4 and not master? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23100 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23168: [SPARK-26207][doc]add PowerIterationClustering (PIC) doc...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/23168 @srowen It's not in master yet. The PR is here https://github.com/apache/spark/pull/23072 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23100 I went through the PR again, and it looks right to me. Merged into master. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/23100 Yes, this diff just reflect renaming OneHotEncoderEstimator to OneHotEncoder. Besides that, this also changes related documents and example codes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23052 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99407/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23174: [SPARK-26194][k8s] Auto generate auth secret for k8s app...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23174 **[Test build #99402 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99402/testReport)** for PR 23174 at commit [`0e36a4b`](https://github.com/apache/spark/commit/0e36a4bb4a5a1ad9abee7e003b7d5f3588cba126). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23174: [SPARK-26194][k8s] Auto generate auth secret for k8s app...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23174 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23052 **[Test build #99407 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99407/testReport)** for PR 23052 at commit [`083d411`](https://github.com/apache/spark/commit/083d411ec1822986dbb82fbe1896a6c0d846c7d8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23174: [SPARK-26194][k8s] Auto generate auth secret for k8s app...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23174 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99402/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23052 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23168: [SPARK-26207][doc]add PowerIterationClustering (PIC) doc...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23168 Pardon, was this also already added to master? I don't see it but I could be missing something. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23173 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23173 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99405/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23173 **[Test build #99405 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99405/testReport)** for PR 23173 at commit [`bfadbf9`](https://github.com/apache/spark/commit/bfadbf9ae5df8e4a1e845c82dddc91d5082f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23172: [SPARK-25957][followup] Build python docker image in sbt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23172 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23172: [SPARK-25957][followup] Build python docker image in sbt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23172 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99406/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23172: [SPARK-25957][followup] Build python docker image in sbt...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23172 **[Test build #99406 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99406/testReport)** for PR 23172 at commit [`b49cfde`](https://github.com/apache/spark/commit/b49cfde2211ee30c67f8ec37b6329c854f798438). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237319928 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -257,16 +289,65 @@ class FileStreamSource( * equal to `end` and will only request offsets greater than `end` in the future. */ override def commit(end: Offset): Unit = { -// No-op for now; FileStreamSource currently garbage-collects files based on timestamp -// and the value of the maxFileAge parameter. +def move(entry: FileEntry, baseArchiveDirPath: String): Unit = { + val curPath = new Path(entry.path) --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237314690 --- Diff: docs/structured-streaming-programming-guide.md --- @@ -530,6 +530,12 @@ Here are the details of all the sources in Spark. "s3://a/dataset.txt" "s3n://a/b/dataset.txt" "s3a://a/b/c/dataset.txt" +cleanSource: option to clean up completed files after processing. +Available options are "archive", "delete", "no_op". If the option is not provided, the default value is "no_op". +When "archive" is provided, additional option sourceArchiveDir must be provided as well. The value of "sourceArchiveDir" must be outside of source path, to ensure archived files are never included to new source files again. +Spark will move source files respecting its own path. For example, if the path of source file is "/a/b/dataset.txt" and the path of archive directory is "/archived/here", file will be moved to "/archived/here/a/b/dataset.txt" +NOTE: Both archiving (via moving) or deleting completed files would introduce overhead (slow down) in each micro-batch, so you need to understand the cost for each operation in your file system before enabling this option. On the other hand, enbling this option would reduce the cost to list source files which is considered as a heavy operation. +NOTE 2: The source path should not be used from multiple queries when enabling this option, because source files will be moved or deleted which behavior may impact the other queries. --- End diff -- NOTE 3: Both delete and move actions are best effort. Failing to delete or move files will not fail the streaming query. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keyt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22911 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237319903 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -257,16 +289,65 @@ class FileStreamSource( * equal to `end` and will only request offsets greater than `end` in the future. */ override def commit(end: Offset): Unit = { -// No-op for now; FileStreamSource currently garbage-collects files based on timestamp -// and the value of the maxFileAge parameter. +def move(entry: FileEntry, baseArchiveDirPath: String): Unit = { + val curPath = new Path(entry.path) + val curPathUri = curPath.toUri + + val newPath = new Path(baseArchiveDirPath + curPathUri.getPath) + try { +logDebug(s"Creating directory if it doesn't exist ${newPath.getParent}") +if (!fs.exists(newPath.getParent)) { + fs.mkdirs(newPath.getParent) +} + +logDebug(s"Archiving completed file $curPath to $newPath") +fs.rename(curPath, newPath) + } catch { +case NonFatal(e) => + // Log to error but swallow exception to avoid process being stopped + logWarning(s"Fail to move $curPath to $newPath / skip moving file.", e) + } +} + +def remove(entry: FileEntry): Unit = { + val curPath = new Path(entry.path) --- End diff -- `val curPath = new Path(new URI(entry.path))` to make it escape/unescape path properly. `entry.path` was created from `Path.toUri.toString`. Could you also add a unit test to test special paths such as `/a/b/a b%.txt`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237319176 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -257,16 +289,65 @@ class FileStreamSource( * equal to `end` and will only request offsets greater than `end` in the future. */ override def commit(end: Offset): Unit = { -// No-op for now; FileStreamSource currently garbage-collects files based on timestamp -// and the value of the maxFileAge parameter. +def move(entry: FileEntry, baseArchiveDirPath: String): Unit = { + val curPath = new Path(entry.path) + val curPathUri = curPath.toUri + + val newPath = new Path(baseArchiveDirPath + curPathUri.getPath) + try { +logDebug(s"Creating directory if it doesn't exist ${newPath.getParent}") +if (!fs.exists(newPath.getParent)) { + fs.mkdirs(newPath.getParent) +} + +logDebug(s"Archiving completed file $curPath to $newPath") +fs.rename(curPath, newPath) + } catch { +case NonFatal(e) => + // Log to error but swallow exception to avoid process being stopped + logWarning(s"Fail to move $curPath to $newPath / skip moving file.", e) + } +} + +def remove(entry: FileEntry): Unit = { + val curPath = new Path(entry.path) + try { +logDebug(s"Removing completed file $curPath") +fs.delete(curPath, false) + } catch { +case NonFatal(e) => + // Log to error but swallow exception to avoid process being stopped + logWarning(s"Fail to remove $curPath / skip removing file.", e) + } +} + +val logOffset = FileStreamSourceOffset(end).logOffset +metadataLog.get(logOffset) match { --- End diff -- you can use `val files = metadataLog.get(Some(logOffset), Some(logOffset)).flatMap(_._2)` to use the underlying cache in FileStreamSourceLog. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237200636 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala --- @@ -74,6 +76,39 @@ class FileStreamOptions(parameters: CaseInsensitiveMap[String]) extends Logging */ val fileNameOnly: Boolean = withBooleanParameter("fileNameOnly", false) + /** + * The archive directory to move completed files. The option will be only effective when + * "cleanSource" is set to "archive". + * + * Note that the completed file will be moved to this archive directory with respecting to + * its own path. + * + * For example, if the path of source file is "/a/b/dataset.txt", and the path of archive + * directory is "/archived/here", file will be moved to "/archived/here/a/b/dataset.txt". + */ + val sourceArchiveDir: Option[String] = parameters.get("sourceArchiveDir") + + /** + * Defines how to clean up completed files. Available options are "archive", "delete", "no_op". + */ + val cleanSource: CleanSourceMode.Value = { +val modeStrOption = parameters.getOrElse("cleanSource", CleanSourceMode.NO_OP.toString) --- End diff -- nit: could you create a method to `CleanSourceMode` to convert a string to `CleanSourceMode.Value`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keyt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22911 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99404/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237315173 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -100,6 +101,36 @@ class FileStreamSource( logInfo(s"maxFilesPerBatch = $maxFilesPerBatch, maxFileAgeMs = $maxFileAgeMs") + ensureNoOverlapBetweenSourceAndArchivePath() --- End diff -- Could you do this check only when CleanSourceMode is `ARCHIVE`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237315718 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -257,16 +289,65 @@ class FileStreamSource( * equal to `end` and will only request offsets greater than `end` in the future. */ override def commit(end: Offset): Unit = { -// No-op for now; FileStreamSource currently garbage-collects files based on timestamp -// and the value of the maxFileAge parameter. +def move(entry: FileEntry, baseArchiveDirPath: String): Unit = { + val curPath = new Path(entry.path) + val curPathUri = curPath.toUri + + val newPath = new Path(baseArchiveDirPath + curPathUri.getPath) + try { +logDebug(s"Creating directory if it doesn't exist ${newPath.getParent}") +if (!fs.exists(newPath.getParent)) { + fs.mkdirs(newPath.getParent) +} + +logDebug(s"Archiving completed file $curPath to $newPath") +fs.rename(curPath, newPath) --- End diff -- It's better to also check the return value of `rename`. A user may reuse a source archive dir and cause path conflicts. We should also log this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237320515 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -100,6 +101,36 @@ class FileStreamSource( logInfo(s"maxFilesPerBatch = $maxFilesPerBatch, maxFileAgeMs = $maxFileAgeMs") + ensureNoOverlapBetweenSourceAndArchivePath() + + private def ensureNoOverlapBetweenSourceAndArchivePath(): Unit = { +@tailrec +def removeGlob(path: Path): Path = { + if (path.getName.contains("*")) { +removeGlob(path.getParent) + } else { +path + } +} + +sourceOptions.sourceArchiveDir match { + case None => + case Some(archiveDir) => +val sourceUri = removeGlob(qualifiedBasePath).toUri +val archiveUri = new Path(archiveDir).toUri + +val sourcePath = sourceUri.getPath +val archivePath = archiveUri.getPath --- End diff -- we need to use `fs.makeQualified` to turn all user provided paths to absolute paths as the user may just pass a relative path. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22952: [SPARK-20568][SS] Provide option to clean up comp...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/22952#discussion_r237314459 --- Diff: docs/structured-streaming-programming-guide.md --- @@ -530,6 +530,12 @@ Here are the details of all the sources in Spark. "s3://a/dataset.txt" "s3n://a/b/dataset.txt" "s3a://a/b/c/dataset.txt" +cleanSource: option to clean up completed files after processing. +Available options are "archive", "delete", "no_op". If the option is not provided, the default value is "no_op". +When "archive" is provided, additional option sourceArchiveDir must be provided as well. The value of "sourceArchiveDir" must be outside of source path, to ensure archived files are never included to new source files again. +Spark will move source files respecting its own path. For example, if the path of source file is "/a/b/dataset.txt" and the path of archive directory is "/archived/here", file will be moved to "/archived/here/a/b/dataset.txt" +NOTE: Both archiving (via moving) or deleting completed files would introduce overhead (slow down) in each micro-batch, so you need to understand the cost for each operation in your file system before enabling this option. On the other hand, enbling this option would reduce the cost to list source files which is considered as a heavy operation. +NOTE 2: The source path should not be used from multiple queries when enabling this option, because source files will be moved or deleted which behavior may impact the other queries. --- End diff -- NOTE 2: The source path should not be used from multiple **sources or** queries when enabling this option, because source files will be moved or deleted which behavior may impact the other **sources and** queries. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keyt...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22911 **[Test build #99404 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99404/testReport)** for PR 22911 at commit [`88f1bb5`](https://github.com/apache/spark/commit/88f1bb516fa6098125fb7f77fd548a1d423cf745). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org