[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...
Github user clockfly commented on a diff in the pull request: https://github.com/apache/spark/pull/13531#discussion_r66538048 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala --- @@ -340,6 +340,40 @@ class FileSourceStrategySuite extends QueryTest with SharedSQLContext with Predi } } + test("SPARK-15654 do not split non-splittable files") { --- End diff -- Should we also test the bin-packing the gzipped file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10412: [SPARK-12447][YARN] Only update the states when executor...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/10412 **[Test build #60250 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60250/consoleFull)** for PR 10412 at commit [`fba6e5c`](https://github.com/apache/spark/commit/fba6e5c3cb2e03dc0a552cacecd6c986a10b75ef). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13371 Is this a bug fix or performance fix? Sorry I don't really understand after reading your description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13581: [SPARK-14321][SQL] Reduce date format cost and string-to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13581 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60245/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13581: [SPARK-14321][SQL] Reduce date format cost and string-to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13581 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13342: [SPARK-15593][SQL]Add DataFrameWriter.foreach to ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/13342#discussion_r66537635 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala --- @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.streaming.ContinuousQuery + +/** + * :: Experimental :: + * A writer to consume data generated by a [[ContinuousQuery]]. Each partition will use a new + * deserialized instance, so you usually should do the initialization work in the `open` method. + * + * @since 2.0.0 + */ +@Experimental +abstract class ForeachWriter[T] extends Serializable { + + /** + * Called when starting to process one partition of new data in the executor side. `version` is + * for data deduplication. When recovering from a failure, some data may be processed twice. But + * it's guarantee that they will be opened with the same "version". + * + * If this method finds this is a partition from a duplicated data set, it can return `false` to + * skip the further data processing. However, `close` still will be called for cleaning up + * resources. + * + * @param partitionId the partition id. + * @param version a unique id for data deduplication. + * @return a flat that indicates if the data should be processed. + */ + def open(partitionId: Long, version: Long): Boolean + + /** + * Called to process the data in the executor side. --- End diff -- Also say, This method will be called only when open returns true. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13581: [SPARK-14321][SQL] Reduce date format cost and string-to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13581 **[Test build #60245 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60245/consoleFull)** for PR 13581 at commit [`a086128`](https://github.com/apache/spark/commit/a0861285ccffe362a1a5758f9c99a36e6f7b2f5d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13342: [SPARK-15593][SQL]Add DataFrameWriter.foreach to ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/13342#discussion_r66537560 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala --- @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.streaming.ContinuousQuery + +/** + * :: Experimental :: + * A writer to consume data generated by a [[ContinuousQuery]]. Each partition will use a new + * deserialized instance, so you usually should do the initialization work in the `open` method. + * + * @since 2.0.0 + */ +@Experimental +abstract class ForeachWriter[T] extends Serializable { + + /** + * Called when starting to process one partition of new data in the executor side. `version` is + * for data deduplication. When recovering from a failure, some data may be processed twice. But + * it's guarantee that they will be opened with the same "version". + * + * If this method finds this is a partition from a duplicated data set, it can return `false` to + * skip the further data processing. However, `close` still will be called for cleaning up + * resources. + * + * @param partitionId the partition id. + * @param version a unique id for data deduplication. + * @return a flat that indicates if the data should be processed. + */ + def open(partitionId: Long, version: Long): Boolean + + /** + * Called to process the data in the executor side. + */ + def process(value: T): Unit + + /** + * Called when stopping to process one partition of new data in the executor side. This is + * guaranteed to be called when a `Throwable` is thrown during processing data. However, + * `close` won't be called in the following cases: + * - JVM crashes without throwing a `Throwable` + * - `open` throws a `Throwable`. + * + * @param errorOrNull the error thrown during processing data or null if nothing is thrown. --- End diff -- if nothing is thrown --> if there was no error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13582: [SPARK-15850][SQL] Remove function grouping in SparkSess...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13582 **[Test build #60249 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60249/consoleFull)** for PR 13582 at commit [`a675b03`](https://github.com/apache/spark/commit/a675b03cb71097564320874d693e4ceaa323f1f1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13342: [SPARK-15593][SQL]Add DataFrameWriter.foreach to ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13342#discussion_r66537102 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -401,6 +381,53 @@ final class DataFrameWriter private[sql](df: DataFrame) { } /** + * :: Experimental :: + * Starts the execution of the streaming query, which will continually send results to the given + * [[ForeachWriter]] as new data arrives. The returned [[ContinuousQuery]] object can be used to + * interact with the stream. + * + * @since 2.0.0 + */ + @Experimental + def foreach(writer: ForeachWriter[T]): ContinuousQuery = { +assertNotBucketed("foreach") +assertStreaming( + "foreach() on streaming Datasets and DataFrames can only be called on continuous queries") + +val queryName = extraOptions.getOrElse("queryName", StreamExecution.nextName) +val sink = new ForeachSink[T](ds.sparkSession.sparkContext.clean(writer))(ds.exprEnc) +df.sparkSession.sessionState.continuousQueryManager.startQuery( + queryName, + getCheckpointLocation(queryName, required = false), + df, + sink, + outputMode, + trigger) + } + + /** + * Returns the checkpointLocation for a query. If `required` is `true` but the checkpoint + * location is not set, [[AnalysisException]] will be thrown. If `required` is `false`, a temp + * folder will be created if the checkpoint location is not set. + */ + private def getCheckpointLocation(queryName: String, required: Boolean): String = { +extraOptions.get("checkpointLocation").map { userSpecified => + new Path(userSpecified).toUri.toString +}.orElse { + df.sparkSession.conf.get(SQLConf.CHECKPOINT_LOCATION).map { location => +new Path(location, queryName).toUri.toString + } +}.getOrElse { + if (required) { +throw new AnalysisException("checkpointLocation must be specified either " + + "through option() or SQLConf") --- End diff -- SQLConf is actually not a user facing concept. Just say `SparkSession.conf.set` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13577: [Minor][Doc] Improve SQLContext Documentation and Fix Sp...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13577 Actually thanks for the pr. Got me to do this one: https://github.com/apache/spark/pull/13582 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13582: [SPARK-15850][SQL] Remove function grouping in Sp...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/13582 [SPARK-15850][SQL] Remove function grouping in SparkSession ## What changes were proposed in this pull request? SparkSession does not have that many functions due to better namespacing, and as a result we probably don't need the function grouping. This patch removes the grouping. Closes #13577. ## How was this patch tested? N/A - this is a documentation change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-15850 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13582.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13582 commit a675b03cb71097564320874d693e4ceaa323f1f1 Author: Reynold XinDate: 2016-06-09T22:56:23Z [SPARK-15850][SQL] Remove function grouping in SparkSession --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13371 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13371 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60246/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13371 **[Test build #60246 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60246/consoleFull)** for PR 13371 at commit [`077f7f8`](https://github.com/apache/spark/commit/077f7f8813a76d38c8a6d898ec54e401c91b6014). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13393: [SPARK-14615][ML][FOLLOWUP] Fix Python examples to use t...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/13393 Hi @yanboliang , could you maybe take a quick look please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13572: [SPARK-15838] [SQL] CACHE TABLE AS SELECT should not rep...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13572 Are we breaking past behavior here? I would say we shouldn't change this, since this is a weird command anyway. I'm actually surprised we have this command. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13342: [SPARK-15593][SQL]Add DataFrameWriter.foreach to ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/13342#discussion_r66534481 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala --- @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.streaming.ContinuousQuery + +/** + * :: Experimental :: + * A writer to consume data generated by a [[ContinuousQuery]]. Each partition will use a new + * deserialized instance, so you usually should do the initialization work in the `open` method. + * + * @since 2.0.0 + */ +@Experimental +abstract class ForeachWriter[T] extends Serializable { + + /** + * Called when starting to process one partition of new data in the executor side. `version` is + * for data deduplication. When recovering from a failure, some data may be processed twice. But + * it's guarantee that they will be opened with the same "version". + * + * If this method finds this is a partition from a duplicated data set, it can return `false` to + * skip the further data processing. However, `close` still will be called for cleaning up + * resources. + * + * @param partitionId the partition id. + * @param version a unique id for data deduplication. + * @return a flat that indicates if the data should be processed. --- End diff -- return true if the corresponding partition and version id should be processed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13571: [SPARK-15369][WIP][RFC][PySpark][SQL] Expose potential t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13571 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13571: [SPARK-15369][WIP][RFC][PySpark][SQL] Expose potential t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13571 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60244/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13571: [SPARK-15369][WIP][RFC][PySpark][SQL] Expose potential t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13571 **[Test build #60244 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60244/consoleFull)** for PR 13571 at commit [`1ca895a`](https://github.com/apache/spark/commit/1ca895a085c79734bd5a51031b60b2553717c256). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13342: [SPARK-15593][SQL]Add DataFrameWriter.foreach to ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/13342#discussion_r66533974 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -401,6 +381,53 @@ final class DataFrameWriter private[sql](df: DataFrame) { } /** + * :: Experimental :: + * Starts the execution of the streaming query, which will continually send results to the given + * [[ForeachWriter]] as new data arrives. The returned [[ContinuousQuery]] object can be used to --- End diff -- Also can you add a simple example code showing the use of ForeachWriter. ``` datasetOfString.write.foreach(new ForeachWriter[String] { def open(): Boolean = { // open connection } def process(record: String) = { // write string to connection } def close(errorOrNull: Throwable): Unit = { // close the connection } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13342: [SPARK-15593][SQL]Add DataFrameWriter.foreach to ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/13342#discussion_r66533645 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -401,6 +381,53 @@ final class DataFrameWriter private[sql](df: DataFrame) { } /** + * :: Experimental :: + * Starts the execution of the streaming query, which will continually send results to the given + * [[ForeachWriter]] as new data arrives. The returned [[ContinuousQuery]] object can be used to --- End diff -- ...as new data arrives. The [[ForeachWriter]] can be used to send the data generated by the DataFrame/Dataset to an external system. The returned... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...
Github user clockfly commented on a diff in the pull request: https://github.com/apache/spark/pull/13531#discussion_r66533473 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala --- @@ -298,6 +309,28 @@ trait FileFormat { } /** + * The base class file format that is based on text file. + */ +abstract class TextBasedFileFormat extends FileFormat { + private var codecFactory: CompressionCodecFactory = null + override def isSplitable( + sparkSession: SparkSession, + options: Map[String, String], + path: Path): Boolean = { +if (codecFactory == null) { + synchronized { --- End diff -- I am not sure we need "synchronized" here or not, do we want to ensure FileFormat is thread safe? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13342: [SPARK-15593][SQL]Add DataFrameWriter.foreach to ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/13342#discussion_r66533244 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -401,6 +381,53 @@ final class DataFrameWriter private[sql](df: DataFrame) { } /** + * :: Experimental :: + * Starts the execution of the streaming query, which will continually send results to the given + * [[ForeachWriter]] as new data arrives. The returned [[ContinuousQuery]] object can be used to + * interact with the stream. + * + * @since 2.0.0 + */ + @Experimental + def foreach(writer: ForeachWriter[T]): ContinuousQuery = { +assertNotBucketed("foreach") +assertStreaming( + "foreach() on streaming Datasets and DataFrames can only be called on continuous queries") --- End diff -- foreach() can only be called on streaming Datasets/DataFrames. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11746: [SPARK-13602][CORE] Add shutdown hook to DriverRunner to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/11746 **[Test build #60248 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60248/consoleFull)** for PR 11746 at commit [`0375da7`](https://github.com/apache/spark/commit/0375da7e2e4c7cfecd9b26c19401ebc31a5e6c69). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13342: [SPARK-15593][SQL]Add DataFrameWriter.foreach to ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/13342#discussion_r66532833 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala --- @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.streaming.ContinuousQuery + +/** + * :: Experimental :: + * A writer to consume data generated by a [[ContinuousQuery]]. Each partition will use a new --- End diff -- A writer to **write** data generated --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13342: [SPARK-15593][SQL]Add DataFrameWriter.foreach to ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/13342#discussion_r66532767 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala --- @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.streaming.ContinuousQuery + +/** + * :: Experimental :: + * A writer to consume data generated by a [[ContinuousQuery]]. Each partition will use a new + * deserialized instance, so you usually should do the initialization work in the `open` method. --- End diff -- usually should --> should do all the initialization (e.g. opening a connection or initiating a transaction) in the `open` method --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #10412: [SPARK-12447][YARN] Only update the states when e...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/10412#discussion_r66532615 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -399,40 +402,61 @@ private[yarn] class YarnAllocator( */ private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = { for (container <- containersToUse) { - numExecutorsRunning += 1 - assert(numExecutorsRunning <= targetNumExecutors) + executorIdCounter += 1 val executorHostname = container.getNodeId.getHost val containerId = container.getId - executorIdCounter += 1 val executorId = executorIdCounter.toString - assert(container.getResource.getMemory >= resource.getMemory) - logInfo("Launching container %s for on host %s".format(containerId, executorHostname)) - executorIdToContainer(executorId) = container - containerIdToExecutorId(container.getId) = executorId - - val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname, -new HashSet[ContainerId]) - - containerSet += containerId - allocatedContainerToHostMap.put(containerId, executorHostname) - - val executorRunnable = new ExecutorRunnable( -container, -conf, -sparkConf, -driverUrl, -executorId, -executorHostname, -executorMemory, -executorCores, -appAttemptId.getApplicationId.toString, -securityMgr) + + def updateInternalState(): Unit = synchronized { +numExecutorsRunning += 1 +assert(numExecutorsRunning <= targetNumExecutors) +executorIdToContainer(executorId) = container +containerIdToExecutorId(container.getId) = executorId + +val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname, + new HashSet[ContainerId]) +containerSet += containerId +allocatedContainerToHostMap.put(containerId, executorHostname) + } + if (launchContainers) { logInfo("Launching ExecutorRunnable. driverUrl: %s, executorHostname: %s".format( driverUrl, executorHostname)) -launcherPool.execute(executorRunnable) + +val future = Future { --- End diff -- I see, I don't have a strong opinion on either, let me change it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13342: [SPARK-15593][SQL]Add DataFrameWriter.foreach to ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/13342#discussion_r66532405 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -401,6 +381,53 @@ final class DataFrameWriter private[sql](df: DataFrame) { } /** + * :: Experimental :: + * Starts the execution of the streaming query, which will continually send results to the given + * [[ForeachWriter]] as new data arrives. The returned [[ContinuousQuery]] object can be used to + * interact with the stream. + * + * @since 2.0.0 + */ + @Experimental + def foreach(writer: ForeachWriter[T]): ContinuousQuery = { +assertNotBucketed("foreach") +assertStreaming( + "foreach() on streaming Datasets and DataFrames can only be called on continuous queries") + +val queryName = extraOptions.getOrElse("queryName", StreamExecution.nextName) +val sink = new ForeachSink[T](ds.sparkSession.sparkContext.clean(writer))(ds.exprEnc) +df.sparkSession.sessionState.continuousQueryManager.startQuery( + queryName, + getCheckpointLocation(queryName, required = false), + df, + sink, + outputMode, + trigger) + } + + /** + * Returns the checkpointLocation for a query. If `required` is `true` but the checkpoint + * location is not set, [[AnalysisException]] will be thrown. If `required` is `false`, a temp + * folder will be created if the checkpoint location is not set. + */ + private def getCheckpointLocation(queryName: String, required: Boolean): String = { --- End diff -- The semantics of this method is very confusing. `required` implies that it will throw error if there is not checkpoint location set. Its not intuitive that when it is not required it creates a temp checkpoint dir. Furthermore it creates temp one named on memory. Does not make sense. Cleaner for this to return an Option[String] and `required` --> `failIfNotSet'. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #10412: [SPARK-12447][YARN] Only update the states when e...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/10412#discussion_r66532257 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -399,40 +402,61 @@ private[yarn] class YarnAllocator( */ private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = { for (container <- containersToUse) { - numExecutorsRunning += 1 - assert(numExecutorsRunning <= targetNumExecutors) + executorIdCounter += 1 val executorHostname = container.getNodeId.getHost val containerId = container.getId - executorIdCounter += 1 val executorId = executorIdCounter.toString - assert(container.getResource.getMemory >= resource.getMemory) - logInfo("Launching container %s for on host %s".format(containerId, executorHostname)) - executorIdToContainer(executorId) = container - containerIdToExecutorId(container.getId) = executorId - - val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname, -new HashSet[ContainerId]) - - containerSet += containerId - allocatedContainerToHostMap.put(containerId, executorHostname) - - val executorRunnable = new ExecutorRunnable( -container, -conf, -sparkConf, -driverUrl, -executorId, -executorHostname, -executorMemory, -executorCores, -appAttemptId.getApplicationId.toString, -securityMgr) + + def updateInternalState(): Unit = synchronized { +numExecutorsRunning += 1 +assert(numExecutorsRunning <= targetNumExecutors) +executorIdToContainer(executorId) = container +containerIdToExecutorId(container.getId) = executorId + +val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname, + new HashSet[ContainerId]) +containerSet += containerId +allocatedContainerToHostMap.put(containerId, executorHostname) + } + if (launchContainers) { logInfo("Launching ExecutorRunnable. driverUrl: %s, executorHostname: %s".format( driverUrl, executorHostname)) -launcherPool.execute(executorRunnable) + +val future = Future { --- End diff -- It's not a potential problem, it just seems like less code to use the Java executor instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #10412: [SPARK-12447][YARN] Only update the states when e...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/10412#discussion_r66532121 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -399,40 +402,61 @@ private[yarn] class YarnAllocator( */ private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = { for (container <- containersToUse) { - numExecutorsRunning += 1 - assert(numExecutorsRunning <= targetNumExecutors) + executorIdCounter += 1 val executorHostname = container.getNodeId.getHost val containerId = container.getId - executorIdCounter += 1 val executorId = executorIdCounter.toString - assert(container.getResource.getMemory >= resource.getMemory) - logInfo("Launching container %s for on host %s".format(containerId, executorHostname)) - executorIdToContainer(executorId) = container - containerIdToExecutorId(container.getId) = executorId - - val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname, -new HashSet[ContainerId]) - - containerSet += containerId - allocatedContainerToHostMap.put(containerId, executorHostname) - - val executorRunnable = new ExecutorRunnable( -container, -conf, -sparkConf, -driverUrl, -executorId, -executorHostname, -executorMemory, -executorCores, -appAttemptId.getApplicationId.toString, -securityMgr) + + def updateInternalState(): Unit = synchronized { +numExecutorsRunning += 1 +assert(numExecutorsRunning <= targetNumExecutors) +executorIdToContainer(executorId) = container +containerIdToExecutorId(container.getId) = executorId + +val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname, + new HashSet[ContainerId]) +containerSet += containerId +allocatedContainerToHostMap.put(containerId, executorHostname) + } + if (launchContainers) { logInfo("Launching ExecutorRunnable. driverUrl: %s, executorHostname: %s".format( driverUrl, executorHostname)) -launcherPool.execute(executorRunnable) + +val future = Future { --- End diff -- @vanzin , what is the potential problem of using Scala Future and implicit value? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13342: [SPARK-15593][SQL]Add DataFrameWriter.foreach to ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/13342#discussion_r66532056 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -401,6 +381,53 @@ final class DataFrameWriter private[sql](df: DataFrame) { } /** + * :: Experimental :: + * Starts the execution of the streaming query, which will continually send results to the given + * [[ForeachWriter]] as new data arrives. The returned [[ContinuousQuery]] object can be used to + * interact with the stream. + * + * @since 2.0.0 + */ + @Experimental + def foreach(writer: ForeachWriter[T]): ContinuousQuery = { +assertNotBucketed("foreach") +assertStreaming( + "foreach() on streaming Datasets and DataFrames can only be called on continuous queries") + +val queryName = extraOptions.getOrElse("queryName", StreamExecution.nextName) +val sink = new ForeachSink[T](ds.sparkSession.sparkContext.clean(writer))(ds.exprEnc) +df.sparkSession.sessionState.continuousQueryManager.startQuery( + queryName, + getCheckpointLocation(queryName, required = false), + df, + sink, + outputMode, + trigger) + } + + /** + * Returns the checkpointLocation for a query. If `required` is `true` but the checkpoint + * location is not set, [[AnalysisException]] will be thrown. If `required` is `false`, a temp + * folder will be created if the checkpoint location is not set. + */ + private def getCheckpointLocation(queryName: String, required: Boolean): String = { +extraOptions.get("checkpointLocation").map { userSpecified => + new Path(userSpecified).toUri.toString +}.orElse { + df.sparkSession.conf.get(SQLConf.CHECKPOINT_LOCATION).map { location => +new Path(location, queryName).toUri.toString + } +}.getOrElse { + if (required) { +throw new AnalysisException("checkpointLocation must be specified either " + + "through option() or SQLConf") --- End diff -- `option()` --> `option("checkpointLocation", ...)` `SQLConf` --> `sqlContext.conf..` (complete it) Makes it easier for the user --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10412: [SPARK-12447][YARN] Only update the states when executor...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/10412 Sure, I will do it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13577: [Minor][Doc] Improve SQLContext Documentation and...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13577#discussion_r66529939 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -346,15 +346,68 @@ class SQLContext private[sql](val sparkSession: SparkSession) sparkSession.createDataFrame(rowRDD, schema, needsConversion) } - + /** + * Creates a [[Dataset]] from a local Seq of data of a given type. This method requires an + * encoder (to convert a JVM object of type `T` to and from the internal Spark SQL representation) + * that is generally created automatically through implicits from a `SparkSession`, or can be + * created explicitly by calling static methods on [[Encoders]]. + * + * == Example == --- End diff -- what does this look like when it is rendered? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13549: [SPARK-15812][SQ][Streaming] Added support for sorting a...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/13549 @marmbrus could you take a look once again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13581: [SPARK-14321][SQL] Reduce date format cost and string-to...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13581 LGTM, pending jenkins --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13581: [SPARK-14321][SQL] Reduce date format cost and string-to...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13581 > in the PR description @hvanhovell not PR title... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13581: [SPARK-14321][SQL] Reduce date format cost and string-to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13581 **[Test build #60247 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60247/consoleFull)** for PR 13581 at commit [`0ca9f99`](https://github.com/apache/spark/commit/0ca9f99e8336494e866708363b8ef415953ed339). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13581: [SPARK-14321][SQL] Reduce date format cost and st...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13581#discussion_r66527669 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -434,24 +436,21 @@ abstract class UnixTime extends BinaryExpression with ExpectsInputTypes { left.dataType match { case StringType if right.foldable => val sdf = classOf[SimpleDateFormat].getName -val fString = if (constFormat == null) null else constFormat.toString -val formatter = ctx.freshName("formatter") -if (fString == null) { +if (formatter == null) { ev.copy(code = s""" boolean ${ev.isNull} = true; ${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)};""") } else { + val formatterName = ctx.addReferenceObj("formatter", formatter, sdf) --- End diff -- nit: maybe we can just use `formatter` as name --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13581: [SPARK-14321][SQL] Reduce date format cost and string-to...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/13581 Tip: if you put "closes #13522" in the PR description then the subsumed PR will be automatically closed once this PR is merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13581: [SPARK-14321][SQL] Reduce date format cost and string-to...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13581 LGTM except a minor comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13581: [SPARK-14321][SQL] Reduce date format cost and st...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13581#discussion_r66525197 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -549,21 +552,22 @@ case class FromUnixTime(sec: Expression, format: Expression) override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val sdf = classOf[SimpleDateFormat].getName if (format.foldable) { - if (constFormat == null) { + if (formatter == null) { ev.copy(code = s""" --- End diff -- same here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13581: [SPARK-14321][SQL] Reduce date format cost and st...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13581#discussion_r66525010 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -434,24 +436,21 @@ abstract class UnixTime extends BinaryExpression with ExpectsInputTypes { left.dataType match { case StringType if right.foldable => val sdf = classOf[SimpleDateFormat].getName -val fString = if (constFormat == null) null else constFormat.toString -val formatter = ctx.freshName("formatter") -if (fString == null) { +if (formatter == null) { ev.copy(code = s""" --- End diff -- this can be: `ev.copy(code = "", isNull = "false", value = ctx.defaultValue(dataType))` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #7786: [SPARK-9468][Yarn][Core] Avoid scheduling tasks on preemp...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/7786 @steveloughran > I suspect that if you get told you are being pre-empted, you aren't likely to get containers elsewhere That's very possible. But I'm just trying to point out that the current change doesn't really make things better. Without killing the executor, you'll still be holding on to resources, except now you wouldn't be using them. So might as well keep using them for as long as you can, or give them back as soon as possible (i.e. as soon as the executor becomes idle). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13371 **[Test build #60246 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60246/consoleFull)** for PR 13371 at commit [`077f7f8`](https://github.com/apache/spark/commit/077f7f8813a76d38c8a6d898ec54e401c91b6014). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13581: [SPARK-14321][SQL] Reduce date format cost and string-to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13581 **[Test build #60245 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60245/consoleFull)** for PR 13581 at commit [`a086128`](https://github.com/apache/spark/commit/a0861285ccffe362a1a5758f9c99a36e6f7b2f5d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13581: [SPARK-14321][SQL] Reduce date format cost and string-to...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/13581 cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13581: [SPARK-14321][SQL] Reduce date format cost and st...
GitHub user hvanhovell opened a pull request: https://github.com/apache/spark/pull/13581 [SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date functions ## What changes were proposed in this pull request? The current implementations of `UnixTime` and `FromUnixTime` do not cache their parser/formatter as much as they could. This PR resolved this issue. This PR is a take over from https://github.com/apache/spark/pull/13522 and further optimizes the re-use of the parser/formatter. It also fixes the improves handling (catching the actual exception instead of `Throwable`). All credits for this work should go to @rajeshbalamohan. ## How was this patch tested? Current tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hvanhovell/spark SPARK-14321 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13581.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13581 commit 602d4a70ba845df3160a07c2c9afe2d5c3c574c4 Author: Rajesh BalamohanDate: 2016-06-06T12:54:02Z [SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date functions commit 425aa7ea30f1f92143fc9f539c7b928255dee1c9 Author: Rajesh Balamohan Date: 2016-06-07T08:17:20Z [SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date functions commit a0861285ccffe362a1a5758f9c99a36e6f7b2f5d Author: Herman van Hovell Date: 2016-06-09T21:07:49Z Polishing code gen and exceptions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 ping @yhuai @rxin @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13580: Revert "[SPARK-14485][CORE] ignore task finished for exe...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/13580 I don't have the strong feelings, but I still believe both cases hurt performance in different ways and in rare cases, and if that's to happen I prefer the version that yells less at the user. In any case, if you feel strongly, no worries on my side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13571: [SPARK-15369][WIP][RFC][PySpark][SQL] Expose potential t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13571 **[Test build #60244 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60244/consoleFull)** for PR 13571 at commit [`1ca895a`](https://github.com/apache/spark/commit/1ca895a085c79734bd5a51031b60b2553717c256). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13549: [SPARK-15812][SQ][Streaming] Added support for sorting a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13549 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60243/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13549: [SPARK-15812][SQ][Streaming] Added support for sorting a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13549 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13549: [SPARK-15812][SQ][Streaming] Added support for sorting a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13549 **[Test build #60243 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60243/consoleFull)** for PR 13549 at commit [`810b802`](https://github.com/apache/spark/commit/810b802060792da78b61c44ad9b656ce3e02b4a3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13155: [SPARK-15370] [SQL] Update RewriteCorrelatedScala...
Github user frreiss commented on a diff in the pull request: https://github.com/apache/spark/pull/13155#discussion_r66509510 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1695,16 +1695,176 @@ object RewriteCorrelatedScalarSubquery extends Rule[LogicalPlan] { } /** + * Statically evaluate an expression containing zero or more placeholders, given a set + * of bindings for placeholder values. + */ + private def evalExpr(expr : Expression, bindings : Map[Long, Option[Any]]) : Option[Any] = { +val rewrittenExpr = expr transform { + case r @ AttributeReference(_, dataType, _, _) => +bindings(r.exprId.id) match { + case Some(v) => Literal.create(v, dataType) + case None => Literal.default(NullType) +} +} +Option(rewrittenExpr.eval()) + } + + /** + * Statically evaluate an expression containing one or more aggregates on an empty input. + */ + private def evalAggOnZeroTups(expr : Expression) : Option[Any] = { +// AggregateExpressions are Unevaluable, so we need to replace all aggregates +// in the expression with the value they would return for zero input tuples. +val rewrittenExpr = expr transform { + case a @ AggregateExpression(aggFunc, _, _, resultId) => +aggFunc.defaultResult.getOrElse(Literal.default(NullType)) +} +Option(rewrittenExpr.eval()) + } + + /** + * Statically evaluate a scalar subquery on an empty input. + * + * WARNING: This method only covers subqueries that pass the checks under + * [[org.apache.spark.sql.catalyst.analysis.CheckAnalysis]]. If the checks in + * CheckAnalysis become less restrictive, this method will need to change. + */ + private def evalSubqueryOnZeroTups(plan: LogicalPlan) : Option[Any] = { +// Inputs to this method will start with a chain of zero or more SubqueryAlias +// and Project operators, followed by an optional Filter, followed by an +// Aggregate. Traverse the operators recursively. +def evalPlan(lp : LogicalPlan) : Map[Long, Option[Any]] = { + lp match { +case SubqueryAlias(_, child) => evalPlan(child) +case Filter(condition, child) => + val bindings = evalPlan(child) + if (bindings.size == 0) bindings + else { +val exprResult = evalExpr(condition, bindings).getOrElse(false) + .asInstanceOf[Boolean] +if (exprResult) bindings else Map() + } + +case Project(projectList, child) => + val bindings = evalPlan(child) + if (bindings.size == 0) { +bindings + } else { +projectList.map(ne => (ne.exprId.id, evalExpr(ne, bindings))).toMap + } + +case Aggregate(_, aggExprs, _) => + // Some of the expressions under the Aggregate node are the join columns + // for joining with the outer query block. Fill those expressions in with + // nulls and statically evaluate the remainder. + aggExprs.map(ne => ne match { +case AttributeReference(_, _, _, _) => (ne.exprId.id, None) +case Alias(AttributeReference(_, _, _, _), _) => (ne.exprId.id, None) +case _ => (ne.exprId.id, evalAggOnZeroTups(ne)) + }).toMap + +case _ => sys.error(s"Unexpected operator in scalar subquery: $lp") + } +} + +val resultMap = evalPlan(plan) + +// By convention, the scalar subquery result is the leftmost field. +resultMap(plan.output.head.exprId.id) + } + + /** + * Split the plan for a scalar subquery into the parts above the Aggregate node + * (first part of returned value) and the parts below the Aggregate node, including + * the Aggregate (second part of returned value) + */ + private def splitSubquery(plan : LogicalPlan) : Tuple2[Seq[LogicalPlan], Aggregate] = { +var topPart = List[LogicalPlan]() --- End diff -- Fixed in my local copy. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13155: [SPARK-15370] [SQL] Update RewriteCorrelatedScala...
Github user frreiss commented on a diff in the pull request: https://github.com/apache/spark/pull/13155#discussion_r66509444 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1695,16 +1695,176 @@ object RewriteCorrelatedScalarSubquery extends Rule[LogicalPlan] { } /** + * Statically evaluate an expression containing zero or more placeholders, given a set + * of bindings for placeholder values. + */ + private def evalExpr(expr : Expression, bindings : Map[Long, Option[Any]]) : Option[Any] = { +val rewrittenExpr = expr transform { + case r @ AttributeReference(_, dataType, _, _) => +bindings(r.exprId.id) match { + case Some(v) => Literal.create(v, dataType) + case None => Literal.default(NullType) +} +} +Option(rewrittenExpr.eval()) + } + + /** + * Statically evaluate an expression containing one or more aggregates on an empty input. + */ + private def evalAggOnZeroTups(expr : Expression) : Option[Any] = { +// AggregateExpressions are Unevaluable, so we need to replace all aggregates +// in the expression with the value they would return for zero input tuples. +val rewrittenExpr = expr transform { + case a @ AggregateExpression(aggFunc, _, _, resultId) => +aggFunc.defaultResult.getOrElse(Literal.default(NullType)) +} +Option(rewrittenExpr.eval()) + } + + /** + * Statically evaluate a scalar subquery on an empty input. + * + * WARNING: This method only covers subqueries that pass the checks under + * [[org.apache.spark.sql.catalyst.analysis.CheckAnalysis]]. If the checks in + * CheckAnalysis become less restrictive, this method will need to change. + */ + private def evalSubqueryOnZeroTups(plan: LogicalPlan) : Option[Any] = { +// Inputs to this method will start with a chain of zero or more SubqueryAlias +// and Project operators, followed by an optional Filter, followed by an +// Aggregate. Traverse the operators recursively. +def evalPlan(lp : LogicalPlan) : Map[Long, Option[Any]] = { + lp match { +case SubqueryAlias(_, child) => evalPlan(child) +case Filter(condition, child) => + val bindings = evalPlan(child) + if (bindings.size == 0) bindings + else { +val exprResult = evalExpr(condition, bindings).getOrElse(false) + .asInstanceOf[Boolean] +if (exprResult) bindings else Map() --- End diff -- Fixed in my local copy --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13155: [SPARK-15370] [SQL] Update RewriteCorrelatedScala...
Github user frreiss commented on a diff in the pull request: https://github.com/apache/spark/pull/13155#discussion_r66509405 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1695,16 +1695,176 @@ object RewriteCorrelatedScalarSubquery extends Rule[LogicalPlan] { } /** + * Statically evaluate an expression containing zero or more placeholders, given a set + * of bindings for placeholder values. + */ + private def evalExpr(expr : Expression, bindings : Map[Long, Option[Any]]) : Option[Any] = { +val rewrittenExpr = expr transform { + case r @ AttributeReference(_, dataType, _, _) => +bindings(r.exprId.id) match { + case Some(v) => Literal.create(v, dataType) + case None => Literal.default(NullType) +} +} +Option(rewrittenExpr.eval()) + } + + /** + * Statically evaluate an expression containing one or more aggregates on an empty input. + */ + private def evalAggOnZeroTups(expr : Expression) : Option[Any] = { +// AggregateExpressions are Unevaluable, so we need to replace all aggregates +// in the expression with the value they would return for zero input tuples. +val rewrittenExpr = expr transform { + case a @ AggregateExpression(aggFunc, _, _, resultId) => +aggFunc.defaultResult.getOrElse(Literal.default(NullType)) +} +Option(rewrittenExpr.eval()) + } + + /** + * Statically evaluate a scalar subquery on an empty input. + * + * WARNING: This method only covers subqueries that pass the checks under + * [[org.apache.spark.sql.catalyst.analysis.CheckAnalysis]]. If the checks in + * CheckAnalysis become less restrictive, this method will need to change. + */ + private def evalSubqueryOnZeroTups(plan: LogicalPlan) : Option[Any] = { +// Inputs to this method will start with a chain of zero or more SubqueryAlias +// and Project operators, followed by an optional Filter, followed by an +// Aggregate. Traverse the operators recursively. +def evalPlan(lp : LogicalPlan) : Map[Long, Option[Any]] = { + lp match { +case SubqueryAlias(_, child) => evalPlan(child) +case Filter(condition, child) => + val bindings = evalPlan(child) + if (bindings.size == 0) bindings --- End diff -- Fixed in my local copy --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13155: [SPARK-15370] [SQL] Update RewriteCorrelatedScalarSubque...
Github user frreiss commented on the issue: https://github.com/apache/spark/pull/13155 @rxin I'll have an updated set of changes in tonight --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13573: [SPARK-15839] Fix Maven doc-jar generation when J...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13573 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13580: Revert "[SPARK-14485][CORE] ignore task finished for exe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13580 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13580: Revert "[SPARK-14485][CORE] ignore task finished for exe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13580 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60241/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13573: [SPARK-15839] Fix Maven doc-jar generation when JAVA_7_H...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/13573 I am going to trigger a snapshot build. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13580: Revert "[SPARK-14485][CORE] ignore task finished for exe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13580 **[Test build #60241 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60241/consoleFull)** for PR 13580 at commit [`912ce5a`](https://github.com/apache/spark/commit/912ce5a66fae9db2cf84956375b24de994568a9e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13573: [SPARK-15839] Fix Maven doc-jar generation when JAVA_7_H...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/13573 Merging to master and branch 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13573: [SPARK-15839] Fix Maven doc-jar generation when JAVA_7_H...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/13573 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13549: [SPARK-15812][SQ][Streaming] Added support for sorting a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13549 **[Test build #60243 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60243/consoleFull)** for PR 13549 at commit [`810b802`](https://github.com/apache/spark/commit/810b802060792da78b61c44ad9b656ce3e02b4a3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13543: [SPARK-15806] [Documentation] update doc for SPAR...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/13543#discussion_r66501686 --- Diff: core/src/main/scala/org/apache/spark/deploy/master/MasterArguments.scala --- @@ -20,18 +20,24 @@ package org.apache.spark.deploy.master import scala.annotation.tailrec import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging import org.apache.spark.util.{IntParam, Utils} /** * Command-line parser for the master. */ -private[master] class MasterArguments(args: Array[String], conf: SparkConf) { +private[master] class MasterArguments(args: Array[String], conf: SparkConf) extends Logging { var host = Utils.localHostName() var port = 7077 var webUiPort = 8080 var propertiesFile: String = null // Check for settings in environment variables + if (System.getenv("SPARK_MASTER_IP") != null) { +logWarning("SPARK_MASTER_IP is deprecated, please use SPARK_MASTER_HOST") +host = System.getenv("SPARK_MASTER_IP") + } + if (System.getenv("SPARK_MASTER_HOST") != null) { --- End diff -- The script sets --host to the value of SPARK_MASTER_HOST though, and I think that's considered the only valid way to start the master. It is a little behavior change but I was wondering if it's best to go ahead and move any unofficial use case away from env variables anyway. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13549: [SPARK-15812][SQ][Streaming] Added support for sorting a...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/13549 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13436: [SPARK-15696][SQL] Improve `crosstab` to have a consiste...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/13436 Hi, @rxin . Could you review this PR and give some opinion when you have some time? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13549: [SPARK-15812][SQ][Streaming] Added support for sorting a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13549 **[Test build #60242 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60242/consoleFull)** for PR 13549 at commit [`810b802`](https://github.com/apache/spark/commit/810b802060792da78b61c44ad9b656ce3e02b4a3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13549: [SPARK-15812][SQ][Streaming] Added support for sorting a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13549 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60242/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13549: [SPARK-15812][SQ][Streaming] Added support for sorting a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13549 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13579: [SPARK-15844] [core] HistoryServer doesn't come up if sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13579 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60238/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13569: [SPARK-15791] Fix NPE in ScalarSubquery
Github user ericl commented on the issue: https://github.com/apache/spark/pull/13569 Yeah I'm looking into how to reproduce this in a unit test. It seems we have to go through `DeserializeToObjectExec` to reproduce the npe: ``` Execution 'q9-v1.4' failed: Job aborted due to stage failure: Task 0 in stage 146.0 failed 4 times, most recent failure: Lost task 0.3 in stage 146.0 (TID 48828, 10.0.206.208): java.lang.NullPointerException at org.apache.spark.sql.execution.ScalarSubquery.dataType(subquery.scala:45) at org.apache.spark.sql.catalyst.expressions.CaseWhenBase.dataType(conditionalExpressions.scala:103) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:165) at org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33) at org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.execution.ProjectExec.output(basicPhysicalOperators.scala:33) at org.apache.spark.sql.execution.WholeStageCodegenExec.output(WholeStageCodegenExec.scala:291) at org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:85) at org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:84) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13579: [SPARK-15844] [core] HistoryServer doesn't come up if sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13579 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13579: [SPARK-15844] [core] HistoryServer doesn't come up if sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13579 **[Test build #60238 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60238/consoleFull)** for PR 13579 at commit [`26d1ad2`](https://github.com/apache/spark/commit/26d1ad243c9c6db227a13303c3781546634794db). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13573: [SPARK-15839] Fix Maven doc-jar generation when JAVA_7_H...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13573 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13573: [SPARK-15839] Fix Maven doc-jar generation when JAVA_7_H...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13573 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60239/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13573: [SPARK-15839] Fix Maven doc-jar generation when JAVA_7_H...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13573 **[Test build #60239 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60239/consoleFull)** for PR 13573 at commit [`523487b`](https://github.com/apache/spark/commit/523487bb0be24c698ee66d8c09430d1909eff81c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13543: [SPARK-15806] [Documentation] update doc for SPAR...
Github user bomeng commented on a diff in the pull request: https://github.com/apache/spark/pull/13543#discussion_r66493098 --- Diff: core/src/main/scala/org/apache/spark/deploy/master/MasterArguments.scala --- @@ -20,18 +20,24 @@ package org.apache.spark.deploy.master import scala.annotation.tailrec import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging import org.apache.spark.util.{IntParam, Utils} /** * Command-line parser for the master. */ -private[master] class MasterArguments(args: Array[String], conf: SparkConf) { +private[master] class MasterArguments(args: Array[String], conf: SparkConf) extends Logging { var host = Utils.localHostName() var port = 7077 var webUiPort = 8080 var propertiesFile: String = null // Check for settings in environment variables + if (System.getenv("SPARK_MASTER_IP") != null) { +logWarning("SPARK_MASTER_IP is deprecated, please use SPARK_MASTER_HOST") +host = System.getenv("SPARK_MASTER_IP") + } + if (System.getenv("SPARK_MASTER_HOST") != null) { --- End diff -- MasterArguments.scala is used by Master.scala main() method, so there is a way to use `SPARK_MASTER_HOST --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13569: [SPARK-15791] Fix NPE in ScalarSubquery
Github user davies commented on the issue: https://github.com/apache/spark/pull/13569 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13400: [SPARK-15655] [SQL] Fix Wrong Partition Column Order whe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13400 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13400: [SPARK-15655] [SQL] Fix Wrong Partition Column Order whe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13400 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60240/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13400: [SPARK-15655] [SQL] Fix Wrong Partition Column Order whe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13400 **[Test build #60240 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60240/consoleFull)** for PR 13400 at commit [`5bc8996`](https://github.com/apache/spark/commit/5bc89966765e1ec37b7c8d167ac6156988a9a720). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13564: [SPARK-15827][BUILD] Publish Spark's forked sbt-p...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13564 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13549: [SPARK-15812][SQ][Streaming] Added support for so...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/13549#discussion_r66491058 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala --- @@ -43,6 +43,41 @@ object UnsupportedOperationChecker { "Queries without streaming sources cannot be executed with write.startStream()")(plan) } +// Disallow multiple streaming aggregations +val aggregates = plan.collect { case a@Aggregate(_, _, _) if a.isStreaming => a } + +if (aggregates.size > 1) { + throwError( +"Multiple streaming aggregations are not supported with " + + "streaming DataFrames/Datasets")(plan) +} + +// Disallow some output mode +outputMode match { + case InternalOutputModes.Append if aggregates.nonEmpty => +throwError( + s"$outputMode output mode not supported when there are streaming aggregations on " + +s"streaming DataFrames/DataSets")(plan) + + case InternalOutputModes.Complete | InternalOutputModes.Update if aggregates.isEmpty => +throwError( + s"$outputMode output mode not supported when there are no streaming aggregations on " + +s"streaming DataFrames/Datasets")(plan) + + case _ => +} + +/** + * Whether the subplan will contain complete data or incremental data in every incremental + * execution. Some operations may be allowed only when the child logical plan gives complete + * data. + */ +def containsCompleteData(subplan: LogicalPlan): Boolean = { + val aggs = plan.collect { case a@Aggregate(_, _, _) if a.isStreaming => a } --- End diff -- I completely agree. This whole file needs to be restructured in future. This is really a 2.0 interim fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13549: [SPARK-15812][SQ][Streaming] Added support for sorting a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13549 **[Test build #60242 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60242/consoleFull)** for PR 13549 at commit [`810b802`](https://github.com/apache/spark/commit/810b802060792da78b61c44ad9b656ce3e02b4a3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13564: [SPARK-15827][BUILD] Publish Spark's forked sbt-pom-read...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/13564 This is now sync'd to Maven Central and has passed tests, so I'm going to merge it into master, branch-2.0, and branch-1.6. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13543: [SPARK-15806] [Documentation] update doc for SPAR...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/13543#discussion_r66490508 --- Diff: core/src/main/scala/org/apache/spark/deploy/master/MasterArguments.scala --- @@ -20,18 +20,24 @@ package org.apache.spark.deploy.master import scala.annotation.tailrec import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging import org.apache.spark.util.{IntParam, Utils} /** * Command-line parser for the master. */ -private[master] class MasterArguments(args: Array[String], conf: SparkConf) { +private[master] class MasterArguments(args: Array[String], conf: SparkConf) extends Logging { var host = Utils.localHostName() var port = 7077 var webUiPort = 8080 var propertiesFile: String = null // Check for settings in environment variables + if (System.getenv("SPARK_MASTER_IP") != null) { +logWarning("SPARK_MASTER_IP is deprecated, please use SPARK_MASTER_HOST") +host = System.getenv("SPARK_MASTER_IP") + } + if (System.getenv("SPARK_MASTER_HOST") != null) { --- End diff -- Yeah, the env param already controls the value of `--host` from the script, so I think this is redundant. Right? Although I don't feel strongly about the second point, I thought it might be tidier to handle the env param in one place rather than two. Env variables do feel like something more from bash-land than Scala-land. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13481: [SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula __...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/13481 Can this be merged @MLnick ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13543: [SPARK-15806] [Documentation] update doc for SPAR...
Github user bomeng commented on a diff in the pull request: https://github.com/apache/spark/pull/13543#discussion_r66488409 --- Diff: core/src/main/scala/org/apache/spark/deploy/master/MasterArguments.scala --- @@ -20,18 +20,24 @@ package org.apache.spark.deploy.master import scala.annotation.tailrec import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging import org.apache.spark.util.{IntParam, Utils} /** * Command-line parser for the master. */ -private[master] class MasterArguments(args: Array[String], conf: SparkConf) { +private[master] class MasterArguments(args: Array[String], conf: SparkConf) extends Logging { var host = Utils.localHostName() var port = 7077 var webUiPort = 8080 var propertiesFile: String = null // Check for settings in environment variables + if (System.getenv("SPARK_MASTER_IP") != null) { +logWarning("SPARK_MASTER_IP is deprecated, please use SPARK_MASTER_HOST") +host = System.getenv("SPARK_MASTER_IP") + } + if (System.getenv("SPARK_MASTER_HOST") != null) { --- End diff -- The code here is just to set its initial values and it may be changed by "--host" configuration. I think we should keep it there for now. For the warning message, we kind of always use logger, not sure it is a good idea to put into the script. I am open to your decision. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13580: Revert "[SPARK-14485][CORE] ignore task finished for exe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13580 **[Test build #60241 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60241/consoleFull)** for PR 13580 at commit [`912ce5a`](https://github.com/apache/spark/commit/912ce5a66fae9db2cf84956375b24de994568a9e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13580: Revert "[SPARK-14485][CORE] ignore task finished ...
GitHub user kayousterhout opened a pull request: https://github.com/apache/spark/pull/13580 Revert "[SPARK-14485][CORE] ignore task finished for executor lost an⦠## What changes were proposed in this pull request? This reverts commit 695dbc816a6d70289abeb145cb62ff4e62b3f49b. This change is being reverted because it hurts performance of some jobs, and only helps in a narrow set of cases. For more discussion, refer to the JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kayousterhout/spark-1 revert-SPARK-14485 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13580.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13580 commit 912ce5a66fae9db2cf84956375b24de994568a9e Author: Kay OusterhoutDate: 2016-06-09T17:07:32Z Revert "[SPARK-14485][CORE] ignore task finished for executor lost and removed by driver" This reverts commit 695dbc816a6d70289abeb145cb62ff4e62b3f49b. This change is being reverted because it hurts performane of some jobs, and only helps in a narrow set of cases. For more discussion, refer to the JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13572: [SPARK-15838] [SQL] CACHE TABLE AS SELECT should not rep...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/13572 LGTM @liancheng anything to add? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13477: [SPARK-15739][GraphX] Expose aggregateMessagesWithActive...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13477 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13540: [SPARK-15788][PYSPARK][ML] PySpark IDFModel missi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13540 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13400: [SPARK-15655] [SQL] Fix Wrong Partition Column Order whe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13400 **[Test build #60240 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60240/consoleFull)** for PR 13400 at commit [`5bc8996`](https://github.com/apache/spark/commit/5bc89966765e1ec37b7c8d167ac6156988a9a720). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13555 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/13531#discussion_r66478617 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala --- @@ -298,6 +309,28 @@ trait FileFormat { } /** + * The base class file format that is based on text file. + */ +abstract class TextBasedFileFormat extends FileFormat { + private var codecFactory: CompressionCodecFactory = null + override def isSplitable( + sparkSession: SparkSession, + options: Map[String, String], + path: Path): Boolean = { +if (codecFactory == null) { + synchronized { +if (codecFactory == null) { + codecFactory = new CompressionCodecFactory( --- End diff -- There could be "io.compression.codecs" in options, it's used by getCodec(). Since all other APIs have that, it's better to have that here too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org