[GitHub] AmplabJenkins commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
AmplabJenkins commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446490616 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/6003/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
AmplabJenkins removed a comment on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446490616 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/6003/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
AmplabJenkins commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446490610 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
AmplabJenkins removed a comment on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446490610 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
SparkQA commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446490498 **[Test build #19 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19/testReport)** for PR 23266 at commit [`d9ae0fe`](https://github.com/apache/spark/commit/d9ae0fe1c61d4b21205152a2b6ea31ef23baf3b8). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] sadhen commented on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way
sadhen commented on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way URL: https://github.com/apache/spark/pull/23276#issuecomment-446490065 OK, I will polish it later. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML
AmplabJenkins removed a comment on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML URL: https://github.com/apache/spark/pull/23144#issuecomment-446488548 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/17/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML
AmplabJenkins removed a comment on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML URL: https://github.com/apache/spark/pull/23144#issuecomment-446488546 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML
AmplabJenkins commented on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML URL: https://github.com/apache/spark/pull/23144#issuecomment-446488546 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA removed a comment on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML
SparkQA removed a comment on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML URL: https://github.com/apache/spark/pull/23144#issuecomment-446475933 **[Test build #17 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17/testReport)** for PR 23144 at commit [`df5559e`](https://github.com/apache/spark/commit/df5559e1b5248844c98be487f3f051d9df6808b7). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML
AmplabJenkins commented on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML URL: https://github.com/apache/spark/pull/23144#issuecomment-446488548 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/17/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML
SparkQA commented on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML URL: https://github.com/apache/spark/pull/23144#issuecomment-446488365 **[Test build #17 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17/testReport)** for PR 23144 at commit [`df5559e`](https://github.com/apache/spark/commit/df5559e1b5248844c98be487f3f051d9df6808b7). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
AmplabJenkins removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446486859 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/11/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
AmplabJenkins removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446486855 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
AmplabJenkins commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446486855 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
AmplabJenkins commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446486859 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/11/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
SparkQA removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446442398 **[Test build #11 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/11/testReport)** for PR 19045 at commit [`9d4fc23`](https://github.com/apache/spark/commit/9d4fc238d2d4e0984d4880d42c4cf6a1f1a52f8b). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
SparkQA commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446486573 **[Test build #11 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/11/testReport)** for PR 19045 at commit [`9d4fc23`](https://github.com/apache/spark/commit/9d4fc238d2d4e0984d4880d42c4cf6a1f1a52f8b). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] kiszk commented on issue #23294: [SPARK-26265][Core][Followup] Put freePage into a finally block
kiszk commented on issue #23294: [SPARK-26265][Core][Followup] Put freePage into a finally block URL: https://github.com/apache/spark/pull/23294#issuecomment-446485354 LGTM, thanks This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] felixcheung commented on issue #23263: [SPARK-23674][ML] Adds Spark ML Events
felixcheung commented on issue #23263: [SPARK-23674][ML] Adds Spark ML Events URL: https://github.com/apache/spark/pull/23263#issuecomment-446483977 my 2c - it does seem useful, though sounds like mostly for Atlas for now. if you can think of a less intrusive way to do this it might be easier to review... This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] felixcheung commented on issue #23263: [SPARK-23674][ML] Adds Spark ML Events
felixcheung commented on issue #23263: [SPARK-23674][ML] Adds Spark ML Events URL: https://github.com/apache/spark/pull/23263#issuecomment-446484055 (oops sorry - too easy to close) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HyukjinKwon opened a new pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
HyukjinKwon opened a new pull request #23263: [SPARK-23674][ML] Adds Spark ML Events URL: https://github.com/apache/spark/pull/23263 ## What changes were proposed in this pull request? This PR proposes to add ML events so that other developers can track and add some actions for them. ## Introduction ML events (like SQL events) can be quite useful when people want to track and make some actions for corresponding ML operations. For instance, I have been working on integrating Apache Spark with [Apache Atlas](https://atlas.apache.org/QuickStart.html). With some custom changes with this PR, I can visualise ML pipeline as below: ![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png) Another good thing that might have to be considered is, that we can interact this with other SQL/Streaming events. For instance, where the input `Dataset` is originated. For instance, with current Apache Spark, I can visualise SQL operations as below: ![screen shot 2018-12-10 at 9 41 36 am](https://user-images.githubusercontent.com/6477701/49706269-d9bdfe00-fc5f-11e8-943a-3309d1856ba5.png) I think we can combine those existing lineages together to easily understand where the data comes and goes. Currently, ML side is a hole so the lineages can't be connected for the current Apache Spark .. To add up, I think it's not to mention how useful it is to track the SQL/Streaming operations. Likewise, I would like to propose ML events as well (as lowest stability `@Unstable` APIs for now - no guarantee about stability). ## Implementation Details ### Sends event (but not expose ML specific listener) **`mllib/src/main/scala/org/apache/spark/ml/events.scala`** ```scala @Unstable case class ...StartEvent(caller, input) @Unstable case class ...EndEvent(caller, output) object MLEvents { // Wrappers to send events: // def with...Event(body) = { // body() // SparkContext.getOrCreate().listenerBus.post(event) // } } ``` This way mimics both: **1. Catalog events (see `org/apache/spark/sql/catalyst/catalog/events.scala`)** - This allows a Catalog specific listener to be added `ExternalCatalogEventListener` - It's implemented in a way of wrapping whole `ExternalCatalog` named `ExternalCatalogWithListener` which delegates the operations to `ExternalCatalog` This is not quite possible in this case because most of instances (like `Pipeline`) will be directly created in most of cases. We might be able to do that via extending `ListenerBus` for all possible instances but IMHO it's too invasive. Also, exposing another ML specific listener sounds a bit too much at this stage. Therefore, I simply borrowed file name and structures here **2. SQL execution events (see `org/apache/spark/sql/execution/SQLExecution.scala`)** - Add an object that wraps a body to send events Current apporach is rather close to this. It has a `with...` wrapper to send events. I borrowed this approach to be consistent. ### Add `...Impl` methods to wrap each to send events **`mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala`** ```diff - def save(...) = { saveImpl(...) } + def save(...) = MLEvents.withSaveInstanceEvent { saveImpl(...) } def saveImpl(...): Unit = ... ``` Note that `saveImpl` was already implemented unlike other instances below. ```diff - def load(...): T + def load(...): T = MLEvents.withLoadInstanceEvent { loadImple(...) } + def loadImpl(...): T ``` **`mllib/src/main/scala/org/apache/spark/ml/Estimator.scala`** ```diff - def fit(...): Model + def fit(...): Model = MLEvents.withFitEvent { fitImpl(...) } + def fitImpl(...): Model ``` **`mllib/src/main/scala/org/apache/spark/ml/Transformer.scala`** ```diff - def transform(...): DataFrame + def transform(...): DataFrame = MLEvents.withTransformEvent { transformImpl(...) } + def transformImpl(...): DataFrame ``` This approach follows the existing way as below in ML: **1. `transform` and `transformImpl`** https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L202-L213 https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L191-L196 https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala#L1037-L1042 **2. `save` and `saveImpl`** https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/util/ReadW
[GitHub] felixcheung closed pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
felixcheung closed pull request #23263: [SPARK-23674][ML] Adds Spark ML Events URL: https://github.com/apache/spark/pull/23263 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/mllib/src/main/scala/org/apache/spark/ml/Estimator.scala b/mllib/src/main/scala/org/apache/spark/ml/Estimator.scala index 1247882d6c1bd..a3c4db06862f8 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/Estimator.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/Estimator.scala @@ -65,7 +65,19 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage { * Fits a model to the input data. */ @Since("2.0.0") - def fit(dataset: Dataset[_]): M + def fit(dataset: Dataset[_]): M = MLEvents.withFitEvent(this, dataset) { +fitImpl(dataset) + } + + /** + * `fit()` handles events and then calls this method. Subclasses should override this + * method to implement the actual fiting a model to the input data. + */ + @Since("3.0.0") + protected def fitImpl(dataset: Dataset[_]): M = { +// Keep this default body for backward compatibility. +throw new UnsupportedOperationException("fitImpl is not implemented.") + } /** * Fits multiple models to the input data with multiple sets of parameters. diff --git a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala index 103082b7b9766..1c781faff129e 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala @@ -132,7 +132,8 @@ class Pipeline @Since("1.4.0") ( * @return fitted pipeline */ @Since("2.0.0") - override def fit(dataset: Dataset[_]): PipelineModel = { + override def fit(dataset: Dataset[_]): PipelineModel = super.fit(dataset) + override protected def fitImpl(dataset: Dataset[_]): PipelineModel = { transformSchema(dataset.schema, logging = true) val theStages = $(stages) // Search for the last estimator. @@ -210,7 +211,7 @@ object Pipeline extends MLReadable[Pipeline] { /** Checked against metadata when loading model */ private val className = classOf[Pipeline].getName -override def load(path: String): Pipeline = { +override protected def loadImpl(path: String): Pipeline = { val (uid: String, stages: Array[PipelineStage]) = SharedReadWrite.load(className, sc, path) new Pipeline(uid).setStages(stages) } @@ -301,7 +302,8 @@ class PipelineModel private[ml] ( } @Since("2.0.0") - override def transform(dataset: Dataset[_]): DataFrame = { + override def transform(dataset: Dataset[_]): DataFrame = super.transform(dataset) + override protected def transformImpl(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) stages.foldLeft(dataset.toDF)((cur, transformer) => transformer.transform(cur)) } @@ -344,7 +346,7 @@ object PipelineModel extends MLReadable[PipelineModel] { /** Checked against metadata when loading model */ private val className = classOf[PipelineModel].getName -override def load(path: String): PipelineModel = { +override protected def loadImpl(path: String): PipelineModel = { val (uid: String, stages: Array[PipelineStage]) = SharedReadWrite.load(className, sc, path) val transformers = stages map { case stage: Transformer => stage diff --git a/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala b/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala index d8f3dfa874439..3731ddae0160c 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala @@ -94,7 +94,10 @@ abstract class Predictor[ /** @group setParam */ def setPredictionCol(value: String): Learner = set(predictionCol, value).asInstanceOf[Learner] - override def fit(dataset: Dataset[_]): M = { + // Explictly call parent's load. Otherwise, MiMa complains. + override def fit(dataset: Dataset[_]): M = super.fit(dataset) + + override protected def fitImpl(dataset: Dataset[_]): M = { // This handles a few items such as schema validation. // Developers only need to implement train(). transformSchema(dataset.schema, logging = true) @@ -199,7 +202,8 @@ abstract class PredictionModel[FeaturesType, M <: PredictionModel[FeaturesType, * @param dataset input dataset * @return transformed dataset with [[predictionCol]] of type `Double` */ - override def transform(dataset: Dataset[_]): DataFrame = { + override def transform( + dataset: Dataset[_]): DataFrame = MLEvents.withTransformEvent(this, dataset) { transformSchema(dataset.schema, logging = true) if ($(predictionCol).nonEmpty) { transformImpl(dataset) @@ -210,7 +21
[GitHub] felixcheung commented on a change in pull request #23292: [SPARK-19827][R][FOLLOWUP] spark.ml R API for PIC
felixcheung commented on a change in pull request #23292: [SPARK-19827][R][FOLLOWUP] spark.ml R API for PIC URL: https://github.com/apache/spark/pull/23292#discussion_r240898609 ## File path: R/pkg/R/mllib_fpm.R ## @@ -183,8 +183,8 @@ setMethod("write.ml", signature(object = "FPGrowthModel", path = "character"), #' @return A complete set of frequent sequential patterns in the input sequences of itemsets. #' The returned \code{SparkDataFrame} contains columns of sequence and corresponding #' frequency. The schema of it will be: -#' \code{sequence: ArrayType(ArrayType(T))} (T is the item type) -#' \code{freq: Long} +#' \code{sequence: ArrayType(ArrayType(T))}, \code{freq: integer} +#' where T is the item type Review comment: let's fix both This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
AmplabJenkins removed a comment on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446481380 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/18/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
SparkQA commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446481366 **[Test build #18 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18/testReport)** for PR 23266 at commit [`4a982b6`](https://github.com/apache/spark/commit/4a982b66fb5cdb01a238fce79dd7f3e4d08076d6). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
AmplabJenkins commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446481376 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
AmplabJenkins removed a comment on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446481376 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA removed a comment on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
SparkQA removed a comment on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446480621 **[Test build #18 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18/testReport)** for PR 23266 at commit [`4a982b6`](https://github.com/apache/spark/commit/4a982b66fb5cdb01a238fce79dd7f3e4d08076d6). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
AmplabJenkins commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446481380 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/18/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
AmplabJenkins removed a comment on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446480632 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/6002/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
AmplabJenkins commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446480632 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/6002/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
AmplabJenkins commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446480626 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
cloud-fan commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446480647 It's a little weird to have table without schema. I leave `schema` in `Table` with comment saying that, implementation can return empty schema if the table is not readable. For a sink that can accept data in any schema, data source API might not be a good option. `Dataset.foreach` could be better. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
SparkQA commented on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446480621 **[Test build #18 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18/testReport)** for PR 23266 at commit [`4a982b6`](https://github.com/apache/spark/commit/4a982b66fb5cdb01a238fce79dd7f3e4d08076d6). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits
AmplabJenkins removed a comment on issue #23266: [SPARK-26313][SQL] move read related methods from Table to read related mix-in traits URL: https://github.com/apache/spark/pull/23266#issuecomment-446480626 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
AmplabJenkins removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446477915 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/10/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query
HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query URL: https://github.com/apache/spark/pull/22952#discussion_r240894917 ## File path: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala ## @@ -1494,6 +1512,287 @@ class FileStreamSourceSuite extends FileStreamSourceTest { newSource.getBatch(None, FileStreamSourceOffset(1)) } } + + test("remove completed files when remove option is enabled") { +def assertFileIsRemoved(files: Array[String], fileName: String): Unit = { + assert(!files.exists(_.startsWith(fileName))) +} + +def assertFileIsNotRemoved(files: Array[String], fileName: String): Unit = { + assert(files.exists(_.startsWith(fileName))) +} + +withTempDirs { case (src, tmp) => + withSQLConf( +SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key -> "2", +// Force deleting the old logs +SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY.key -> "1" + ) { +val option = Map("latestFirst" -> "false", "maxFilesPerTrigger" -> "1", + "cleanSource" -> "delete") + +val fileStream = createFileStream("text", src.getCanonicalPath, options = option) +val filtered = fileStream.filter($"value" contains "keep") + +testStream(filtered)( + AddTextFileData("keep1", src, tmp, tmpFilePrefix = "keep1"), + CheckAnswer("keep1"), + AssertOnQuery("input file removed") { _: StreamExecution => +// it doesn't rename any file yet +assertFileIsNotRemoved(src.list(), "keep1") +true + }, + AddTextFileData("keep2", src, tmp, tmpFilePrefix = "ke ep2 %"), + CheckAnswer("keep1", "keep2"), + AssertOnQuery("input file removed") { _: StreamExecution => +val files = src.list() + +// it renames input file for first batch, but not for second batch yet +assertFileIsRemoved(files, "keep1") +assertFileIsNotRemoved(files, "ke ep2 %") + +true + }, + AddTextFileData("keep3", src, tmp, tmpFilePrefix = "keep3"), + CheckAnswer("keep1", "keep2", "keep3"), + AssertOnQuery("input file renamed") { _: StreamExecution => +val files = src.list() + +// it renames input file for second batch, but not third batch yet +assertFileIsRemoved(files, "ke ep2 %") +assertFileIsNotRemoved(files, "keep3") + +true + } +) + } +} + } + + test("move completed files to archive directory when archive option is enabled") { + +withThreeTempDirs { case (src, tmp, archiveDir) => + withSQLConf( +SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key -> "2", +// Force deleting the old logs +SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY.key -> "1" + ) { +val option = Map("latestFirst" -> "false", "maxFilesPerTrigger" -> "1", + "cleanSource" -> "archive", "sourceArchiveDir" -> archiveDir.getAbsolutePath) + +val fileStream = createFileStream("text", s"${src.getCanonicalPath}/*/*", + options = option) +val filtered = fileStream.filter($"value" contains "keep") + +// src/k %1 +// file: src/k1 %1/keep1 +val dirForKeep1 = new File(src, "k %1") +// src/k1/k 2 +// file: src/k1/k 2/keep2 +val dirForKeep2 = new File(dirForKeep1, "k 2") +// src/k3 +// file: src/k3/keep3 +val dirForKeep3 = new File(src, "k3") + +val expectedMovedDir1 = new File(archiveDir.getAbsolutePath + dirForKeep1.toURI.getPath) +val expectedMovedDir2 = new File(archiveDir.getAbsolutePath + dirForKeep2.toURI.getPath) +val expectedMovedDir3 = new File(archiveDir.getAbsolutePath + dirForKeep3.toURI.getPath) + +testStream(filtered)( + AddTextFileData("keep1", dirForKeep1, tmp, tmpFilePrefix = "keep1"), + CheckAnswer("keep1"), + AssertOnQuery("input file archived") { _: StreamExecution => +// it doesn't rename any file yet +assertFileIsNotMoved(dirForKeep1, expectedMovedDir1, "keep1") +true + }, + AddTextFileData("keep2", dirForKeep2, tmp, tmpFilePrefix = "keep2 %"), + CheckAnswer("keep1", "keep2"), + AssertOnQuery("input file archived") { _: StreamExecution => +// it renames input file for first batch, but not for second batch yet +assertFileIsMoved(dirForKeep1, expectedMovedDir1, "keep1") +assertFileIsNotMoved(dirForKeep2, expectedMovedDir2, "keep2 %") +true + }, + AddTextFileData("keep3", dirForKeep3, tmp, tmpFilePrefix = "keep3"), + CheckAnswer("keep1", "keep2", "keep3"), + AssertOnQuery("input file archived") { _: StreamExecution => +// it renam
[GitHub] AmplabJenkins removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
AmplabJenkins removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446477908 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query
HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query URL: https://github.com/apache/spark/pull/22952#discussion_r240894769 ## File path: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala ## @@ -1494,6 +1512,287 @@ class FileStreamSourceSuite extends FileStreamSourceTest { newSource.getBatch(None, FileStreamSourceOffset(1)) } } + + test("remove completed files when remove option is enabled") { +def assertFileIsRemoved(files: Array[String], fileName: String): Unit = { + assert(!files.exists(_.startsWith(fileName))) +} + +def assertFileIsNotRemoved(files: Array[String], fileName: String): Unit = { + assert(files.exists(_.startsWith(fileName))) +} + +withTempDirs { case (src, tmp) => + withSQLConf( +SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key -> "2", +// Force deleting the old logs +SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY.key -> "1" + ) { +val option = Map("latestFirst" -> "false", "maxFilesPerTrigger" -> "1", + "cleanSource" -> "delete") + +val fileStream = createFileStream("text", src.getCanonicalPath, options = option) +val filtered = fileStream.filter($"value" contains "keep") + +testStream(filtered)( + AddTextFileData("keep1", src, tmp, tmpFilePrefix = "keep1"), + CheckAnswer("keep1"), + AssertOnQuery("input file removed") { _: StreamExecution => +// it doesn't rename any file yet +assertFileIsNotRemoved(src.list(), "keep1") +true + }, + AddTextFileData("keep2", src, tmp, tmpFilePrefix = "ke ep2 %"), + CheckAnswer("keep1", "keep2"), + AssertOnQuery("input file removed") { _: StreamExecution => +val files = src.list() + +// it renames input file for first batch, but not for second batch yet +assertFileIsRemoved(files, "keep1") +assertFileIsNotRemoved(files, "ke ep2 %") + +true + }, + AddTextFileData("keep3", src, tmp, tmpFilePrefix = "keep3"), + CheckAnswer("keep1", "keep2", "keep3"), + AssertOnQuery("input file renamed") { _: StreamExecution => +val files = src.list() + +// it renames input file for second batch, but not third batch yet +assertFileIsRemoved(files, "ke ep2 %") +assertFileIsNotRemoved(files, "keep3") + +true + } +) + } +} + } + + test("move completed files to archive directory when archive option is enabled") { + +withThreeTempDirs { case (src, tmp, archiveDir) => + withSQLConf( +SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key -> "2", +// Force deleting the old logs +SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY.key -> "1" + ) { +val option = Map("latestFirst" -> "false", "maxFilesPerTrigger" -> "1", + "cleanSource" -> "archive", "sourceArchiveDir" -> archiveDir.getAbsolutePath) + +val fileStream = createFileStream("text", s"${src.getCanonicalPath}/*/*", + options = option) +val filtered = fileStream.filter($"value" contains "keep") + +// src/k %1 +// file: src/k1 %1/keep1 +val dirForKeep1 = new File(src, "k %1") +// src/k1/k 2 +// file: src/k1/k 2/keep2 +val dirForKeep2 = new File(dirForKeep1, "k 2") +// src/k3 +// file: src/k3/keep3 +val dirForKeep3 = new File(src, "k3") + +val expectedMovedDir1 = new File(archiveDir.getAbsolutePath + dirForKeep1.toURI.getPath) +val expectedMovedDir2 = new File(archiveDir.getAbsolutePath + dirForKeep2.toURI.getPath) +val expectedMovedDir3 = new File(archiveDir.getAbsolutePath + dirForKeep3.toURI.getPath) + +testStream(filtered)( + AddTextFileData("keep1", dirForKeep1, tmp, tmpFilePrefix = "keep1"), + CheckAnswer("keep1"), + AssertOnQuery("input file archived") { _: StreamExecution => +// it doesn't rename any file yet +assertFileIsNotMoved(dirForKeep1, expectedMovedDir1, "keep1") +true + }, + AddTextFileData("keep2", dirForKeep2, tmp, tmpFilePrefix = "keep2 %"), + CheckAnswer("keep1", "keep2"), + AssertOnQuery("input file archived") { _: StreamExecution => +// it renames input file for first batch, but not for second batch yet +assertFileIsMoved(dirForKeep1, expectedMovedDir1, "keep1") +assertFileIsNotMoved(dirForKeep2, expectedMovedDir2, "keep2 %") +true + }, + AddTextFileData("keep3", dirForKeep3, tmp, tmpFilePrefix = "keep3"), + CheckAnswer("keep1", "keep2", "keep3"), + AssertOnQuery("input file archived") { _: StreamExecution => +// it renam
[GitHub] AmplabJenkins commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
AmplabJenkins commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446477915 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/10/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
AmplabJenkins commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446477908 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query
HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query URL: https://github.com/apache/spark/pull/22952#discussion_r240894769 ## File path: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala ## @@ -1494,6 +1512,287 @@ class FileStreamSourceSuite extends FileStreamSourceTest { newSource.getBatch(None, FileStreamSourceOffset(1)) } } + + test("remove completed files when remove option is enabled") { +def assertFileIsRemoved(files: Array[String], fileName: String): Unit = { + assert(!files.exists(_.startsWith(fileName))) +} + +def assertFileIsNotRemoved(files: Array[String], fileName: String): Unit = { + assert(files.exists(_.startsWith(fileName))) +} + +withTempDirs { case (src, tmp) => + withSQLConf( +SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key -> "2", +// Force deleting the old logs +SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY.key -> "1" + ) { +val option = Map("latestFirst" -> "false", "maxFilesPerTrigger" -> "1", + "cleanSource" -> "delete") + +val fileStream = createFileStream("text", src.getCanonicalPath, options = option) +val filtered = fileStream.filter($"value" contains "keep") + +testStream(filtered)( + AddTextFileData("keep1", src, tmp, tmpFilePrefix = "keep1"), + CheckAnswer("keep1"), + AssertOnQuery("input file removed") { _: StreamExecution => +// it doesn't rename any file yet +assertFileIsNotRemoved(src.list(), "keep1") +true + }, + AddTextFileData("keep2", src, tmp, tmpFilePrefix = "ke ep2 %"), + CheckAnswer("keep1", "keep2"), + AssertOnQuery("input file removed") { _: StreamExecution => +val files = src.list() + +// it renames input file for first batch, but not for second batch yet +assertFileIsRemoved(files, "keep1") +assertFileIsNotRemoved(files, "ke ep2 %") + +true + }, + AddTextFileData("keep3", src, tmp, tmpFilePrefix = "keep3"), + CheckAnswer("keep1", "keep2", "keep3"), + AssertOnQuery("input file renamed") { _: StreamExecution => +val files = src.list() + +// it renames input file for second batch, but not third batch yet +assertFileIsRemoved(files, "ke ep2 %") +assertFileIsNotRemoved(files, "keep3") + +true + } +) + } +} + } + + test("move completed files to archive directory when archive option is enabled") { + +withThreeTempDirs { case (src, tmp, archiveDir) => + withSQLConf( +SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key -> "2", +// Force deleting the old logs +SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY.key -> "1" + ) { +val option = Map("latestFirst" -> "false", "maxFilesPerTrigger" -> "1", + "cleanSource" -> "archive", "sourceArchiveDir" -> archiveDir.getAbsolutePath) + +val fileStream = createFileStream("text", s"${src.getCanonicalPath}/*/*", + options = option) +val filtered = fileStream.filter($"value" contains "keep") + +// src/k %1 +// file: src/k1 %1/keep1 +val dirForKeep1 = new File(src, "k %1") +// src/k1/k 2 +// file: src/k1/k 2/keep2 +val dirForKeep2 = new File(dirForKeep1, "k 2") +// src/k3 +// file: src/k3/keep3 +val dirForKeep3 = new File(src, "k3") + +val expectedMovedDir1 = new File(archiveDir.getAbsolutePath + dirForKeep1.toURI.getPath) +val expectedMovedDir2 = new File(archiveDir.getAbsolutePath + dirForKeep2.toURI.getPath) +val expectedMovedDir3 = new File(archiveDir.getAbsolutePath + dirForKeep3.toURI.getPath) + +testStream(filtered)( + AddTextFileData("keep1", dirForKeep1, tmp, tmpFilePrefix = "keep1"), + CheckAnswer("keep1"), + AssertOnQuery("input file archived") { _: StreamExecution => +// it doesn't rename any file yet +assertFileIsNotMoved(dirForKeep1, expectedMovedDir1, "keep1") +true + }, + AddTextFileData("keep2", dirForKeep2, tmp, tmpFilePrefix = "keep2 %"), + CheckAnswer("keep1", "keep2"), + AssertOnQuery("input file archived") { _: StreamExecution => +// it renames input file for first batch, but not for second batch yet +assertFileIsMoved(dirForKeep1, expectedMovedDir1, "keep1") +assertFileIsNotMoved(dirForKeep2, expectedMovedDir2, "keep2 %") +true + }, + AddTextFileData("keep3", dirForKeep3, tmp, tmpFilePrefix = "keep3"), + CheckAnswer("keep1", "keep2", "keep3"), + AssertOnQuery("input file archived") { _: StreamExecution => +// it renam
[GitHub] SparkQA removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
SparkQA removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446435549 **[Test build #10 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/10/testReport)** for PR 19045 at commit [`abb0609`](https://github.com/apache/spark/commit/abb060949797cec31ec02080c5720ccec9df3f5c). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
SparkQA commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446477611 **[Test build #10 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/10/testReport)** for PR 19045 at commit [`abb0609`](https://github.com/apache/spark/commit/abb060949797cec31ec02080c5720ccec9df3f5c). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query
HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query URL: https://github.com/apache/spark/pull/22952#discussion_r240894533 ## File path: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala ## @@ -1494,6 +1512,287 @@ class FileStreamSourceSuite extends FileStreamSourceTest { newSource.getBatch(None, FileStreamSourceOffset(1)) } } + + test("remove completed files when remove option is enabled") { +def assertFileIsRemoved(files: Array[String], fileName: String): Unit = { + assert(!files.exists(_.startsWith(fileName))) +} + +def assertFileIsNotRemoved(files: Array[String], fileName: String): Unit = { + assert(files.exists(_.startsWith(fileName))) +} + +withTempDirs { case (src, tmp) => + withSQLConf( +SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key -> "2", +// Force deleting the old logs +SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY.key -> "1" + ) { +val option = Map("latestFirst" -> "false", "maxFilesPerTrigger" -> "1", + "cleanSource" -> "delete") + +val fileStream = createFileStream("text", src.getCanonicalPath, options = option) +val filtered = fileStream.filter($"value" contains "keep") + +testStream(filtered)( + AddTextFileData("keep1", src, tmp, tmpFilePrefix = "keep1"), + CheckAnswer("keep1"), + AssertOnQuery("input file removed") { _: StreamExecution => +// it doesn't rename any file yet +assertFileIsNotRemoved(src.list(), "keep1") +true + }, + AddTextFileData("keep2", src, tmp, tmpFilePrefix = "ke ep2 %"), + CheckAnswer("keep1", "keep2"), + AssertOnQuery("input file removed") { _: StreamExecution => +val files = src.list() + +// it renames input file for first batch, but not for second batch yet +assertFileIsRemoved(files, "keep1") +assertFileIsNotRemoved(files, "ke ep2 %") + +true + }, + AddTextFileData("keep3", src, tmp, tmpFilePrefix = "keep3"), + CheckAnswer("keep1", "keep2", "keep3"), + AssertOnQuery("input file renamed") { _: StreamExecution => +val files = src.list() + +// it renames input file for second batch, but not third batch yet +assertFileIsRemoved(files, "ke ep2 %") +assertFileIsNotRemoved(files, "keep3") + +true + } +) + } +} + } + + test("move completed files to archive directory when archive option is enabled") { + +withThreeTempDirs { case (src, tmp, archiveDir) => + withSQLConf( +SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key -> "2", +// Force deleting the old logs +SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY.key -> "1" + ) { +val option = Map("latestFirst" -> "false", "maxFilesPerTrigger" -> "1", + "cleanSource" -> "archive", "sourceArchiveDir" -> archiveDir.getAbsolutePath) + +val fileStream = createFileStream("text", s"${src.getCanonicalPath}/*/*", + options = option) +val filtered = fileStream.filter($"value" contains "keep") + +// src/k %1 +// file: src/k1 %1/keep1 +val dirForKeep1 = new File(src, "k %1") +// src/k1/k 2 +// file: src/k1/k 2/keep2 +val dirForKeep2 = new File(dirForKeep1, "k 2") +// src/k3 +// file: src/k3/keep3 +val dirForKeep3 = new File(src, "k3") + +val expectedMovedDir1 = new File(archiveDir.getAbsolutePath + dirForKeep1.toURI.getPath) +val expectedMovedDir2 = new File(archiveDir.getAbsolutePath + dirForKeep2.toURI.getPath) +val expectedMovedDir3 = new File(archiveDir.getAbsolutePath + dirForKeep3.toURI.getPath) + +testStream(filtered)( + AddTextFileData("keep1", dirForKeep1, tmp, tmpFilePrefix = "keep1"), + CheckAnswer("keep1"), + AssertOnQuery("input file archived") { _: StreamExecution => +// it doesn't rename any file yet +assertFileIsNotMoved(dirForKeep1, expectedMovedDir1, "keep1") +true + }, + AddTextFileData("keep2", dirForKeep2, tmp, tmpFilePrefix = "keep2 %"), + CheckAnswer("keep1", "keep2"), + AssertOnQuery("input file archived") { _: StreamExecution => +// it renames input file for first batch, but not for second batch yet +assertFileIsMoved(dirForKeep1, expectedMovedDir1, "keep1") +assertFileIsNotMoved(dirForKeep2, expectedMovedDir2, "keep2 %") +true + }, + AddTextFileData("keep3", dirForKeep3, tmp, tmpFilePrefix = "keep3"), + CheckAnswer("keep1", "keep2", "keep3"), + AssertOnQuery("input file archived") { _: StreamExecution => +// it renam
[GitHub] HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query
HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query URL: https://github.com/apache/spark/pull/22952#discussion_r240894261 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala ## @@ -329,4 +345,124 @@ object FileStreamSource { def size: Int = map.size() } + + class FileStreamSourceCleaner(fileSystem: FileSystem, sourcePath: Path, +baseArchivePathString: Option[String]) extends Logging { + +private val sourceGlobFilters: Seq[GlobFilter] = buildSourceGlobFilters(sourcePath) + +private val baseArchivePath: Option[Path] = baseArchivePathString.map(new Path(_)) + +def archive(entry: FileEntry): Unit = { + require(baseArchivePath.isDefined) + + val curPath = new Path(new URI(entry.path)) + val curPathUri = curPath.toUri + + val newPath = buildArchiveFilePath(curPathUri) + + if (baseArchivePath.get.depth() <= 2) { Review comment: It's just a shortcut of condition: given that source pattern will read files which exactly match to the pattern, or their parents match the pattern. Once we are adding base archive path as prefix of final path, if the depth of base archive path is greater than 2, FileStreamSource will not read the archived file as source. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] gatorsmile commented on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way
gatorsmile commented on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way URL: https://github.com/apache/spark/pull/23276#issuecomment-446477152 > The semicolon (;) terminates an SQL command. It cannot appear anywhere within a command, except within a string constant or quoted identifier. Thus, you need to consider both double quotes, single quotes. Also backstick, which is being used by Spark SQL for quoting identifiers. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML
AmplabJenkins removed a comment on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML URL: https://github.com/apache/spark/pull/23144#issuecomment-446476895 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/6001/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML
AmplabJenkins commented on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML URL: https://github.com/apache/spark/pull/23144#issuecomment-446476895 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/6001/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML
AmplabJenkins removed a comment on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML URL: https://github.com/apache/spark/pull/23144#issuecomment-446476891 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML
AmplabJenkins commented on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML URL: https://github.com/apache/spark/pull/23144#issuecomment-446476891 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] gatorsmile commented on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way
gatorsmile commented on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way URL: https://github.com/apache/spark/pull/23276#issuecomment-446476683 Let us first update the PR description to explain the problem we want to resolve. Regarding comments, our SQL parser follows PostgreSQL. You can read SqlBase.g4 to confirm it. ``` SIMPLE_COMMENT : '--' ~[\r\n]* '\r'? '\n'? -> channel(HIDDEN) ; BRACKETED_EMPTY_COMMENT : '/**/' -> channel(HIDDEN) ; BRACKETED_COMMENT : '/*' ~[+] .*? '*/' -> channel(HIDDEN) ; ``` In PostgreSQL doc: https://www.postgresql.org/docs/9.4/sql-syntax-lexical.html > A comment is a sequence of characters beginning with double dashes and extending to the end of the line, e.g.: ``` -- This is a standard SQL comment ``` > Alternatively, C-style block comments can be used: ``` /* multiline comment * with nesting: /* nested block comment */ */ ``` > where the comment begins with /* and extends to the matching occurrence of */. These block comments nest, as specified in the SQL standard but unlike C, so that one can comment out larger blocks of code that might contain existing block comments. > A comment is removed from the input stream before further syntax analysis and is effectively replaced by whitespace. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query
HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query URL: https://github.com/apache/spark/pull/22952#discussion_r240893378 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala ## @@ -329,4 +345,124 @@ object FileStreamSource { def size: Int = map.size() } + + class FileStreamSourceCleaner(fileSystem: FileSystem, sourcePath: Path, +baseArchivePathString: Option[String]) extends Logging { + +private val sourceGlobFilters: Seq[GlobFilter] = buildSourceGlobFilters(sourcePath) + +private val baseArchivePath: Option[Path] = baseArchivePathString.map(new Path(_)) + +def archive(entry: FileEntry): Unit = { + require(baseArchivePath.isDefined) + + val curPath = new Path(new URI(entry.path)) + val curPathUri = curPath.toUri + + val newPath = buildArchiveFilePath(curPathUri) + + if (baseArchivePath.get.depth() <= 2) { +if (isArchiveFileMatchedAgainstSourcePattern(newPath)) { + logWarning(s"Fail to move $curPath to $newPath - destination matches " + +s"to source path/pattern. Skip moving file.") +} else { + doArchive(curPath, newPath) +} + } else { +// there's no chance for archive file to be matched against source pattern +doArchive(curPath, newPath) + } +} + +def remove(entry: FileEntry): Unit = { + val curPath = new Path(new URI(entry.path)) + try { +logDebug(s"Removing completed file $curPath") +fileSystem.delete(curPath, false) Review comment: Indeed. Thanks for pointing it out! Will address. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query
HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query URL: https://github.com/apache/spark/pull/22952#discussion_r240893237 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala ## @@ -329,4 +345,124 @@ object FileStreamSource { def size: Int = map.size() } + + class FileStreamSourceCleaner(fileSystem: FileSystem, sourcePath: Path, +baseArchivePathString: Option[String]) extends Logging { Review comment: Same here: I think it is allowed at most two lines, but no strong opinions regarding this. Will address. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query
HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query URL: https://github.com/apache/spark/pull/22952#discussion_r240893299 ## File path: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala ## @@ -1494,6 +1512,287 @@ class FileStreamSourceSuite extends FileStreamSourceTest { newSource.getBatch(None, FileStreamSourceOffset(1)) } } + + test("remove completed files when remove option is enabled") { +def assertFileIsRemoved(files: Array[String], fileName: String): Unit = { + assert(!files.exists(_.startsWith(fileName))) +} + +def assertFileIsNotRemoved(files: Array[String], fileName: String): Unit = { + assert(files.exists(_.startsWith(fileName))) +} + +withTempDirs { case (src, tmp) => + withSQLConf( +SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key -> "2", +// Force deleting the old logs +SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY.key -> "1" + ) { +val option = Map("latestFirst" -> "false", "maxFilesPerTrigger" -> "1", + "cleanSource" -> "delete") + +val fileStream = createFileStream("text", src.getCanonicalPath, options = option) +val filtered = fileStream.filter($"value" contains "keep") + +testStream(filtered)( + AddTextFileData("keep1", src, tmp, tmpFilePrefix = "keep1"), + CheckAnswer("keep1"), + AssertOnQuery("input file removed") { _: StreamExecution => +// it doesn't rename any file yet +assertFileIsNotRemoved(src.list(), "keep1") +true + }, + AddTextFileData("keep2", src, tmp, tmpFilePrefix = "ke ep2 %"), + CheckAnswer("keep1", "keep2"), + AssertOnQuery("input file removed") { _: StreamExecution => +val files = src.list() + +// it renames input file for first batch, but not for second batch yet +assertFileIsRemoved(files, "keep1") +assertFileIsNotRemoved(files, "ke ep2 %") + +true + }, + AddTextFileData("keep3", src, tmp, tmpFilePrefix = "keep3"), + CheckAnswer("keep1", "keep2", "keep3"), + AssertOnQuery("input file renamed") { _: StreamExecution => +val files = src.list() + +// it renames input file for second batch, but not third batch yet +assertFileIsRemoved(files, "ke ep2 %") +assertFileIsNotRemoved(files, "keep3") + +true + } +) + } +} + } + + test("move completed files to archive directory when archive option is enabled") { + +withThreeTempDirs { case (src, tmp, archiveDir) => + withSQLConf( +SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key -> "2", +// Force deleting the old logs +SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY.key -> "1" + ) { +val option = Map("latestFirst" -> "false", "maxFilesPerTrigger" -> "1", + "cleanSource" -> "archive", "sourceArchiveDir" -> archiveDir.getAbsolutePath) + +val fileStream = createFileStream("text", s"${src.getCanonicalPath}/*/*", + options = option) +val filtered = fileStream.filter($"value" contains "keep") + +// src/k %1 +// file: src/k1 %1/keep1 +val dirForKeep1 = new File(src, "k %1") +// src/k1/k 2 +// file: src/k1/k 2/keep2 Review comment: Nice catch! Will address. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query
HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query URL: https://github.com/apache/spark/pull/22952#discussion_r240893237 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala ## @@ -329,4 +345,124 @@ object FileStreamSource { def size: Int = map.size() } + + class FileStreamSourceCleaner(fileSystem: FileSystem, sourcePath: Path, +baseArchivePathString: Option[String]) extends Logging { Review comment: Same here: I think it is accepted at most two lines, but no strong opinions regarding this. Will address. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML
SparkQA commented on issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML URL: https://github.com/apache/spark/pull/23144#issuecomment-446475933 **[Test build #17 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17/testReport)** for PR 23144 at commit [`df5559e`](https://github.com/apache/spark/commit/df5559e1b5248844c98be487f3f051d9df6808b7). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query
HeartSaVioR commented on a change in pull request #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query URL: https://github.com/apache/spark/pull/22952#discussion_r240893073 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala ## @@ -329,4 +345,124 @@ object FileStreamSource { def size: Int = map.size() } + + class FileStreamSourceCleaner(fileSystem: FileSystem, sourcePath: Path, Review comment: No need to be public. Will address. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR edited a comment on issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for query plans
HeartSaVioR edited a comment on issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for query plans URL: https://github.com/apache/spark/pull/23169#issuecomment-446475026 @DaveDeCaprio I'm not sure only adding doc in SQLConf is enough for end users to be aware when troubleshooting their issues, but let's wait on committers feedback. cc. to @vanzin I guess you might be interested on reviewing this, given that you've left comments in this issue and relevant issues for Apache JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] HeartSaVioR commented on issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for query plans
HeartSaVioR commented on issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for query plans URL: https://github.com/apache/spark/pull/23169#issuecomment-446475026 @DaveDeCaprio I'm not sure only adding doc in SQLConf is enough for end users to be aware when troubleshooting their issues, but let's wait on committers feedback. @vanzin You might be interested on reviewing this, given that you've left comments in this issue and relevant issues for Apache JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23296: [mllib]MergeAggregate serialize and deserialize function use Bytebuffer to optimize
AmplabJenkins removed a comment on issue #23296: [mllib]MergeAggregate serialize and deserialize function use Bytebuffer to optimize URL: https://github.com/apache/spark/pull/23296#issuecomment-446472926 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23296: [mllib]MergeAggregate serialize and deserialize function use Bytebuffer to optimize
AmplabJenkins commented on issue #23296: [mllib]MergeAggregate serialize and deserialize function use Bytebuffer to optimize URL: https://github.com/apache/spark/pull/23296#issuecomment-446473122 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23296: [mllib]MergeAggregate serialize and deserialize function use Bytebuffer to optimize
AmplabJenkins removed a comment on issue #23296: [mllib]MergeAggregate serialize and deserialize function use Bytebuffer to optimize URL: https://github.com/apache/spark/pull/23296#issuecomment-446472860 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23296: [mllib]MergeAggregate serialize and deserialize function use Bytebuffer to optimize
AmplabJenkins commented on issue #23296: [mllib]MergeAggregate serialize and deserialize function use Bytebuffer to optimize URL: https://github.com/apache/spark/pull/23296#issuecomment-446472926 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23296: [mllib]MergeAggregate serialize and deserialize function use Bytebuffer to optimize
AmplabJenkins commented on issue #23296: [mllib]MergeAggregate serialize and deserialize function use Bytebuffer to optimize URL: https://github.com/apache/spark/pull/23296#issuecomment-446472860 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] lcqzte10192193 opened a new pull request #23296: [mllib]MergeAggregate serialize and deserialize function use Bytebuffer to optimize
lcqzte10192193 opened a new pull request #23296: [mllib]MergeAggregate serialize and deserialize function use Bytebuffer to optimize URL: https://github.com/apache/spark/pull/23296 ## What changes were proposed in this pull request? MergeAggregate serialize and deserialize function can use ByteBuffer to optimize. ## How was this patch tested? SummarizerSuite This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
AmplabJenkins removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446470421 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/8/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
AmplabJenkins commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446470417 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
AmplabJenkins removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446470417 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
AmplabJenkins commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446470421 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/8/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
SparkQA removed a comment on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446429391 **[Test build #8 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/8/testReport)** for PR 19045 at commit [`43abf98`](https://github.com/apache/spark/commit/43abf98dc1eff6bcc3d5c9ba8700a74c7922de2c). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown
SparkQA commented on issue #19045: [WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown URL: https://github.com/apache/spark/pull/19045#issuecomment-446469994 **[Test build #8 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/8/testReport)** for PR 19045 at commit [`43abf98`](https://github.com/apache/spark/commit/43abf98dc1eff6bcc3d5c9ba8700a74c7922de2c). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23273: [MINOR][DOC] Fix comments of ConvertToLocalRelation rule
SparkQA commented on issue #23273: [MINOR][DOC] Fix comments of ConvertToLocalRelation rule URL: https://github.com/apache/spark/pull/23273#issuecomment-446468234 **[Test build #16 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16/testReport)** for PR 23273 at commit [`1cca847`](https://github.com/apache/spark/commit/1cca8475e999139ab8f22f5cb583d9edce3cf550). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] benmccann commented on issue #20974: [SPARK-23862][SQL] Spark ExpressionEncoder should support java enum type in scala
benmccann commented on issue #20974: [SPARK-23862][SQL] Spark ExpressionEncoder should support java enum type in scala URL: https://github.com/apache/spark/pull/20974#issuecomment-446467737 @gatorsmile @cloud-fan would you be able to give this PR a look or suggest a more appropriate reviewer? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning
AmplabJenkins removed a comment on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning URL: https://github.com/apache/spark/pull/23249#issuecomment-446467464 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning
AmplabJenkins removed a comment on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning URL: https://github.com/apache/spark/pull/23249#issuecomment-446467466 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/6000/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning
AmplabJenkins commented on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning URL: https://github.com/apache/spark/pull/23249#issuecomment-446467466 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/6000/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way
AmplabJenkins removed a comment on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way URL: https://github.com/apache/spark/pull/23276#issuecomment-446467310 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/9/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning
AmplabJenkins commented on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning URL: https://github.com/apache/spark/pull/23249#issuecomment-446467464 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning
SparkQA commented on issue #23249: [SPARK-26297][SQL] improve the doc of Distribution/Partitioning URL: https://github.com/apache/spark/pull/23249#issuecomment-446467343 **[Test build #15 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15/testReport)** for PR 23249 at commit [`cb94add`](https://github.com/apache/spark/commit/cb94addf126f9885c73c1dc8ead26fbcfece4441). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way
AmplabJenkins commented on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way URL: https://github.com/apache/spark/pull/23276#issuecomment-446467310 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/9/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way
AmplabJenkins removed a comment on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way URL: https://github.com/apache/spark/pull/23276#issuecomment-446467307 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way
AmplabJenkins commented on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way URL: https://github.com/apache/spark/pull/23276#issuecomment-446467307 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way
SparkQA commented on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way URL: https://github.com/apache/spark/pull/23276#issuecomment-446467127 **[Test build #9 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/9/testReport)** for PR 23276 at commit [`2f77b5d`](https://github.com/apache/spark/commit/2f77b5de4b8576d3b95ff6105f3587b692b25351). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA removed a comment on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way
SparkQA removed a comment on issue #23276: [SPARK-26321][SQL] Split a SQL in correct way URL: https://github.com/apache/spark/pull/23276#issuecomment-446433577 **[Test build #9 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/9/testReport)** for PR 23276 at commit [`2f77b5d`](https://github.com/apache/spark/commit/2f77b5de4b8576d3b95ff6105f3587b692b25351). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] seancxmao commented on a change in pull request #23273: [MINOR][DOC] Fix comments of ConvertToLocalRelation rule
seancxmao commented on a change in pull request #23273: [MINOR][DOC] Fix comments of ConvertToLocalRelation rule URL: https://github.com/apache/spark/pull/23273#discussion_r240886207 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -1370,10 +1370,10 @@ object DecimalAggregates extends Rule[LogicalPlan] { } /** - * Converts local operations (i.e. ones that don't require data exchange) on LocalRelation to - * another LocalRelation. + * Converts local operations (i.e. ones that don't require data exchange) on [[LocalRelation]] to + * another [[LocalRelation]]. * - * This is relatively simple as it currently handles only 2 single case: Project and Limit. + * This rule currently handles 3 cases: [[Project]], [[Limit]] and [[Filter]]. Review comment: @srowen Sorry, I found that the links just do not work for scaladoc, though they works in IDE like Intellij IDEA. I should have generated the docs to see if there's any problem. I have changed `[[...]]` to backticks in a new commit. Also, I have some findings: * Because `org/apache/spark/sql/catalyst` is excluded for doc generation, I include catalyst in `SparkBuild.scala` and run `build/sbt unidoc`, there are many errors/warnings, should we fix these problems? Seems simple but many places. * As for `[[...]]`, fully qualified class name should be used if a link points to a class in another package. If the link points to a class in the same package, package prefix could be omitted. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] cloud-fan commented on issue #23294: [SPARK-26265][Core][Followup] Put freePage into a finally block
cloud-fan commented on issue #23294: [SPARK-26265][Core][Followup] Put freePage into a finally block URL: https://github.com/apache/spark/pull/23294#issuecomment-446467008 I think this can be merged to 2.4 without conflict. I'll ping you if it doesn't. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] rvesse commented on a change in pull request #22904: [SPARK-25887][K8S] Configurable K8S context support
rvesse commented on a change in pull request #22904: [SPARK-25887][K8S] Configurable K8S context support URL: https://github.com/apache/spark/pull/22904#discussion_r240886000 ## File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesClientFactory.scala ## @@ -67,8 +66,16 @@ private[spark] object SparkKubernetesClientFactory { val dispatcher = new Dispatcher( ThreadUtils.newDaemonCachedThreadPool("kubernetes-dispatcher")) -// TODO [SPARK-25887] Create builder in a way that respects configurable context -val config = new ConfigBuilder() +// Allow for specifying a context used to auto-configure from the users K8S config file +val kubeContext = sparkConf.get(KUBERNETES_CONTEXT).filter(c => StringUtils.isNotBlank(c)) +logInfo(s"Auto-configuring K8S client using " + + s"${if (kubeContext.isEmpty) s"context ${kubeContext.get}" else "current context"}" + + s" from users K8S config file") + +// Start from an auto-configured config with the desired context +// Fabric 8 uses null to indicate that the users current context should be used so if no +// explicit setting pass null +val config = new ConfigBuilder(autoConfigure(kubeContext.getOrElse(null))) Review comment: @aditanase One practical use case is interactive Spark code where the Spark REPL (and driver) is running at some login shell to which the user has access to e.g. on a physical edge node of the cluster and the actual Spark executors are running as K8S pods. This is something we have customers using today. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page
AmplabJenkins removed a comment on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page URL: https://github.com/apache/spark/pull/23068#issuecomment-446463272 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5999/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page
AmplabJenkins removed a comment on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page URL: https://github.com/apache/spark/pull/23068#issuecomment-446463271 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page
AmplabJenkins commented on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page URL: https://github.com/apache/spark/pull/23068#issuecomment-446463271 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page
AmplabJenkins commented on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page URL: https://github.com/apache/spark/pull/23068#issuecomment-446463272 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5999/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page
SparkQA commented on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page URL: https://github.com/apache/spark/pull/23068#issuecomment-446463245 **[Test build #14 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14/testReport)** for PR 23068 at commit [`0a63604`](https://github.com/apache/spark/commit/0a636049ecc721cdd31cd676fce79aeb6582dd7c). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] gatorsmile commented on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page
gatorsmile commented on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page URL: https://github.com/apache/spark/pull/23068#issuecomment-446462519 LGTM pending Jenkins. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] gatorsmile commented on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page
gatorsmile commented on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page URL: https://github.com/apache/spark/pull/23068#issuecomment-446462497 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23295: [MINOR][SQL]Change `ThreadLocal.withInitial` to Scala lambda syntax
AmplabJenkins removed a comment on issue #23295: [MINOR][SQL]Change `ThreadLocal.withInitial` to Scala lambda syntax URL: https://github.com/apache/spark/pull/23295#issuecomment-446460866 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5998/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins removed a comment on issue #23295: [MINOR][SQL]Change `ThreadLocal.withInitial` to Scala lambda syntax
AmplabJenkins removed a comment on issue #23295: [MINOR][SQL]Change `ThreadLocal.withInitial` to Scala lambda syntax URL: https://github.com/apache/spark/pull/23295#issuecomment-446460865 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] SparkQA commented on issue #23295: [MINOR][SQL]Change `ThreadLocal.withInitial` to Scala lambda syntax
SparkQA commented on issue #23295: [MINOR][SQL]Change `ThreadLocal.withInitial` to Scala lambda syntax URL: https://github.com/apache/spark/pull/23295#issuecomment-446460982 **[Test build #13 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13/testReport)** for PR 23295 at commit [`361ed5c`](https://github.com/apache/spark/commit/361ed5c51f113a0cea6ec2d6d7dcfd9c877a4694). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] AmplabJenkins commented on issue #23295: [MINOR][SQL]Change `ThreadLocal.withInitial` to Scala lambda syntax
AmplabJenkins commented on issue #23295: [MINOR][SQL]Change `ThreadLocal.withInitial` to Scala lambda syntax URL: https://github.com/apache/spark/pull/23295#issuecomment-446460865 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org