[GitHub] spark issue #14100: [SPARK-16433][SQL]Improve StreamingQuery.explain when no...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14100 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14100: [SPARK-16433][SQL]Improve StreamingQuery.explain when no...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14100 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61945/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14100: [SPARK-16433][SQL]Improve StreamingQuery.explain when no...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14100 **[Test build #61945 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61945/consoleFull)** for PR 14100 at commit [`e00bc53`](https://github.com/apache/spark/commit/e00bc53cc9835104045a5c451a058c77d84f382c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @gatorsmile This is the benchmark results. No significant difference. Before this patch: Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz Parquet reader: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative reading Parquet file 855 / 1162 2.4 417.2 1.0X After this patch: Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz Parquet reader: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative reading Parquet file 874 / 1228 2.3 426.7 1.0X --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...
Github user janplus commented on a diff in the pull request: https://github.com/apache/spark/pull/14008#discussion_r70018330 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -652,6 +654,145 @@ case class StringRPad(str: Expression, len: Expression, pad: Expression) override def prettyName: String = "rpad" } +object ParseUrl { + private val HOST = UTF8String.fromString("HOST") + private val PATH = UTF8String.fromString("PATH") + private val QUERY = UTF8String.fromString("QUERY") + private val REF = UTF8String.fromString("REF") + private val PROTOCOL = UTF8String.fromString("PROTOCOL") + private val FILE = UTF8String.fromString("FILE") + private val AUTHORITY = UTF8String.fromString("AUTHORITY") + private val USERINFO = UTF8String.fromString("USERINFO") + private val REGEXPREFIX = "(&|^)" + private val REGEXSUBFIX = "=([^&]*)" +} + +/** + * Extracts a part from a URL + */ +@ExpressionDescription( + usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL", + extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO. +Key specifies which query to extract. +Examples: + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST') + 'spark.apache.org' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY') + 'query=1' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') + '1'""") --- End diff -- Complication with the error `Error:(686, 9) annotation argument needs to be a constant; found: scala.this.Predef.augmentString` So I should probably remain the current extended description string. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13890: [SPARK-16189][SQL] Add ExternalRDD logical plan for inpu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13890 **[Test build #61950 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61950/consoleFull)** for PR 13890 at commit [`e218f5f`](https://github.com/apache/spark/commit/e218f5fbef996ca3e9606cc68bf433c83ebf224e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14099: [SPARK-16432] Empty blocks fail to serialize due to asse...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14099 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61944/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14099: [SPARK-16432] Empty blocks fail to serialize due to asse...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14099 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14099: [SPARK-16432] Empty blocks fail to serialize due to asse...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14099 **[Test build #61944 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61944/consoleFull)** for PR 14099 at commit [`9ce8146`](https://github.com/apache/spark/commit/9ce81468b41c7515c6ce3fd4791e0ecc03a7de19). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14071#discussion_r70017144 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala --- @@ -162,25 +147,28 @@ private[hive] case class MetastoreRelation( tPartition.setTableName(tableName) tPartition.setValues(partitionKeys.map(a => p.spec(a.name)).asJava) - val sd = new org.apache.hadoop.hive.metastore.api.StorageDescriptor() + val sd = toHiveStorage(p.storage, catalogTable.schema.map(toHiveColumn)) tPartition.setSd(sd) - sd.setCols(catalogTable.schema.map(toHiveColumn).asJava) - p.storage.locationUri.foreach(sd.setLocation) - p.storage.inputFormat.foreach(sd.setInputFormat) - p.storage.outputFormat.foreach(sd.setOutputFormat) + new Partition(hiveQlTable, tPartition) +} + } - val serdeInfo = new org.apache.hadoop.hive.metastore.api.SerDeInfo - sd.setSerdeInfo(serdeInfo) - // maps and lists should be set only after all elements are ready (see HIVE-7975) - p.storage.serde.foreach(serdeInfo.setSerializationLib) + private def toHiveStorage(storage: CatalogStorageFormat, schema: Seq[FieldSchema]) = { --- End diff -- This function has a bug, right? `storage` is not being used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14028 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61943/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14028 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14028 **[Test build #61943 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61943/consoleFull)** for PR 14028 at commit [`6570a98`](https://github.com/apache/spark/commit/6570a9874e60ecb9366ea37a0e5dfe06b821dc62). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14102: [SPARK-16434][SQL][WIP] Avoid record-per type dispatch i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14102 **[Test build #61949 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61949/consoleFull)** for PR 14102 at commit [`2d77f66`](https://github.com/apache/spark/commit/2d77f66f2c78bb139212011bfa1fa2efbf6b9d5b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13991: [SPARK-16318][SQL] Implement various xpath functions
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13991 Also - rather than having concrete implementations for all of these, why don't we use RuntimeReplaceable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14082: [SPARK-16381][SQL][SparkR] Update SQL examples and progr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14082 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61946/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14082: [SPARK-16381][SQL][SparkR] Update SQL examples and progr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14082 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14082: [SPARK-16381][SQL][SparkR] Update SQL examples and progr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14082 **[Test build #61946 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61946/consoleFull)** for PR 14082 at commit [`1af09f3`](https://github.com/apache/spark/commit/1af09f31fd506143e8fe45b530dd46e39df76d6b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14101: [SPARK-16431] [ML] Add a unified method that accepts sin...
Github user husseinhazimeh commented on the issue: https://github.com/apache/spark/pull/14101 @mengxr @sethah can you review this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14102: [SPARK-16434][SQL][WIP] Avoid record-per type dispatch i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14102 **[Test build #61948 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61948/consoleFull)** for PR 14102 at commit [`74fa944`](https://github.com/apache/spark/commit/74fa944209491b9884dbfc8b71e56e36b45e28a4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14102: [SPARK-16434][SQL][WIP] Avoid record-per type dis...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/14102 [SPARK-16434][SQL][WIP] Avoid record-per type dispatch in JSON when reading ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-16434 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14102.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14102 commit 74fa944209491b9884dbfc8b71e56e36b45e28a4 Author: hyukjinkwonDate: 2016-07-08T01:14:18Z Avoid record-per type dispatch in JSON when reading --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14101: [SPARK-16431] [ML] Add a unified method that accepts sin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14101 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14065: [SPARK-14743][YARN][WIP] Add a configurable token manage...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/14065 Thanks a lot @tgravescs and @vanzin for your suggestions, I will change the codes accordingly, greatly appreciate your comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14101: [SPARK-16431] [ML] Add a unified method that acce...
GitHub user husseinhazimeh opened a pull request: https://github.com/apache/spark/pull/14101 [SPARK-16431] [ML] Add a unified method that accepts single instances to feature transformers and predictors ## What changes were proposed in this pull request? Current feature transformers in spark.ml can only operate on DataFrames and don't have a method that accepts single instances. A typical transformer has a User-Defined Function (udf) in its `transform` method which includes a set of operations on the features of a single instance: ``` val column_operation = udf {operations on single instance} ``` Adding a new method called `transformInstance` that operates directly on single instances and using it in the udf instead can be useful: ``` def transformInstance(features: featuresType): OutputType = {operations on single instance} val column_operation = udf {transformInstance} ``` Predictors also don't have a public method that does predictions on single instances. `transformInstance` can be easily added to predictors by acting as a wrapper for the internal method predict (which takes features as input). Note: The proposed method in this change is added to all predictors and feature transformers except OnehotEncoder, VectorSlicer, and Word2Vec, which might require bigger changes due to dependencies on the dataset's schema (they can be fixed using simple hacks but this needs to be discussed) ## Benefits 1. Providing a low-latency transformation/prediction method to support machine learning applications that require real-time predictions. The current `transform` method has a relatively high latency when transforming single instances or small batches due to the overhead introduced by DataFrame operations. I measured the latency required to classify a single instance in the 20 Newsgroups dataset using the current `transform` method and the proposed `transformInstance`. The ML pipeline contains a tokenizer, stopword remover, TF hasher, IDF, scaler, and Logisitc Regression. The table below shows the latency percentiles in milliseconds after measuring the time to classify 700 documents. Transformation Method | P50 | P90 | P99 | Max - | --- | --- | --- | --- transform | 31.44 | 39.43 | 67.75 | 126.97 transformInstance | 0.16 | 0.38 | 1.16 | 3.2 `transformInstance` is 200 times faster on average and can classify a document in less than a millisecond. By profiling the code of `transform`, it turns out that every transformer in the pipeline wastes 5 milliseconds on average in DataFrame-related operations when transforming a single instance. This implies that the latency increases linearly with the pipeline size which can be problematic. 2. Increasing code readability and allowing easier debugging as operations on rows are now combined into a function that can be tested independently of the higher-level `transform` method. 3. Adding flexibility to create new models: for example, check this [comment](https://github.com/apache/spark/pull/8883#issuecomment-215559305) on supporting new ensemble methods. ## How was this patch tested? The current tests for transformers and predictors, which invoke `transformInstance` internally, passed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/husseinhazimeh/spark lowlatency Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14101.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14101 commit e8b3de1e599225fa71fecc17aaa34998863fb38b Author: Hussein HazimehDate: 2016-07-07T20:50:22Z Add transformInstance method to predictors and transformers commit ca213e338bde7da2e308b2ffd9c3fa1b5d26122e Author: Hussein Hazimeh Date: 2016-07-07T21:03:46Z Update LogisticRegression.scala commit 1fe5b18a0519d324ed53108ddd809a421a811f50 Author: Hussein Hazimeh Date: 2016-07-07T21:21:45Z Update HashingTF.scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14071#discussion_r70012893 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -403,17 +400,18 @@ object CreateDataSourceTableUtils extends Logging { assert(partitionColumns.isEmpty) assert(relation.partitionSchema.isEmpty) + var storage = CatalogStorageFormat( +locationUri = None, --- End diff -- oh it's a mistake... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14096 Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14094: [SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrig...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14094 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14094: [SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrig...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14094 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61936/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14094: [SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrig...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14094 **[Test build #61936 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61936/consoleFull)** for PR 14094 at commit [`9663b42`](https://github.com/apache/spark/commit/9663b429c5da0e7967530cea82fb372221d9741f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14082#discussion_r70012365 --- Diff: examples/src/main/r/RSparkSQLExample.R --- @@ -0,0 +1,175 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# $example on:init_session$ +sparkR.session() +# $example off:init_session$ + + +# $example on:create_DataFrames$ +df <- read.json("examples/src/main/resources/people.json") + +# Displays the content of the DataFrame +showDF(df) +# $example off:create_DataFrames$ + + +# $example on:untyped_transformations$ +# Create the DataFrame +df <- read.json("examples/src/main/resources/people.json") + +# Show the content of the DataFrame +showDF(df) +## age name +## null Michael +## 30 Andy +## 19 Justin + +# Print the schema in a tree format +printSchema(df) +## root +## |-- age: long (nullable = true) +## |-- name: string (nullable = true) + +# Select only the "name" column +showDF(select(df, "name")) +## name +## Michael +## Andy +## Justin + +# Select everybody, but increment the age by 1 +showDF(select(df, df$name, df$age + 1)) +## name(age + 1) +## Michael null +## Andy31 +## Justin 20 + +# Select people older than 21 +showDF(where(df, df$age > 21)) +## age name +## 30 Andy + +# Count people by age +showDF(count(groupBy(df, "age"))) +## age count +## null 1 +## 19 1 +## 30 1 +# $example off:untyped_transformations$ + + +# $example on:sql_query$ +df <- sql("SELECT * FROM table") --- End diff -- Lets register `df` from above using `createExternalTable` and then run the query. We should aim for a case where this R file should be executable on its own --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13991: [SPARK-16318][SQL] Implement various xpath functions
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13991 For this one I think we should consider supporting only foldable literals for the path component. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14100: [SPARK-16433][SQL]Improve StreamingQuery.explain when no...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14100 **[Test build #61945 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61945/consoleFull)** for PR 14100 at commit [`e00bc53`](https://github.com/apache/spark/commit/e00bc53cc9835104045a5c451a058c77d84f382c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14082: [SPARK-16381][SQL][SparkR] Update SQL examples and progr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14082 **[Test build #61946 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61946/consoleFull)** for PR 14082 at commit [`1af09f3`](https://github.com/apache/spark/commit/1af09f31fd506143e8fe45b530dd46e39df76d6b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13991: [SPARK-16318][SQL] Implement various xpath functions
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13991 **[Test build #61947 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61947/consoleFull)** for PR 13991 at commit [`2d48ae5`](https://github.com/apache/spark/commit/2d48ae553736928cb4003a530608513d6f3307ab). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14096: [SPARK-16425][R] `describe()` should not fail wit...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14096 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14100: [SPARK-16433][SQL]Improve StreamingQuery.explain ...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/14100 [SPARK-16433][SQL]Improve StreamingQuery.explain when no data arrives ## What changes were proposed in this pull request? Display `No physical plan. Waiting for data.` instead of `N/A` for StreamingQuery.explain when no data arrives because `N/A` doesn't provide meaningful information. ## How was this patch tested? Existing unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark SPARK-16433 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14100.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14100 commit e00bc53cc9835104045a5c451a058c77d84f382c Author: Shixiong ZhuDate: 2016-07-08T00:45:24Z Improve StreamingQuery.explain when no data arrives --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14096 LGTM. Thanks @dongjoon-hyun -- Merging this to master, branch-2.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14096 Hi, @shivaram . Now, it's ready for review again. Please let me know if there is something to do more. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14096 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61942/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/14082#discussion_r70011796 --- Diff: examples/src/main/r/RSparkSQLExample.R --- @@ -0,0 +1,175 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# $example on:init_session$ +sparkR.session() +# $example off:init_session$ + + +# $example on:create_DataFrames$ +df <- read.json("examples/src/main/resources/people.json") + +# Displays the content of the DataFrame +showDF(df) +# $example off:create_DataFrames$ + + +# $example on:untyped_transformations$ +# Create the DataFrame +df <- read.json("examples/src/main/resources/people.json") + +# Show the content of the DataFrame +showDF(df) +## age name +## null Michael +## 30 Andy +## 19 Justin + +# Print the schema in a tree format +printSchema(df) +## root +## |-- age: long (nullable = true) +## |-- name: string (nullable = true) + +# Select only the "name" column +showDF(select(df, "name")) +## name +## Michael +## Andy +## Justin + +# Select everybody, but increment the age by 1 +showDF(select(df, df$name, df$age + 1)) +## name(age + 1) +## Michael null +## Andy31 +## Justin 20 + +# Select people older than 21 +showDF(where(df, df$age > 21)) +## age name +## 30 Andy + +# Count people by age +showDF(count(groupBy(df, "age"))) +## age count +## null 1 +## 19 1 +## 30 1 +# $example off:untyped_transformations$ + + +# $example on:sql_query$ +df <- sql("SELECT * FROM table") --- End diff -- here should I add more to create the `table`? or just leave it since it's only for demonstration purpose? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14096 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14096 **[Test build #61942 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61942/consoleFull)** for PR 14096 at commit [`c332c52`](https://github.com/apache/spark/commit/c332c52ba7fb9e23a372a74cd2ac6ea8b3704b5d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13969: [SPARK-16284][SQL] Implement reflect SQL function
Github user petermaxlee commented on the issue: https://github.com/apache/spark/pull/13969 Ping! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13374: [SPARK-13638][SQL] Add escapeAll option to CSV DataFrame...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13374 @jurriaan should this be called quoteAll rather than escapeAll? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce implementation with a dense...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13680 LGTM except some minor comments, it's pretty close! One easy-to-ignore comment: https://github.com/apache/spark/pull/13680/files#r69849567 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/14082#discussion_r70011133 --- Diff: examples/src/main/r/RSparkSQLExample.R --- @@ -0,0 +1,175 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# $example on:init_session$ +sparkR.session() +# $example off:init_session$ + + +# $example on:create_DataFrames$ +df <- read.json("examples/src/main/resources/people.json") + +# Displays the content of the DataFrame +showDF(df) --- End diff -- @shivaram your idea is better, vote+1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/14082#discussion_r70011061 --- Diff: examples/src/main/r/RSparkSQLExample.R --- @@ -0,0 +1,175 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# $example on:init_session$ +sparkR.session() +# $example off:init_session$ + + +# $example on:create_DataFrames$ +df <- read.json("examples/src/main/resources/people.json") + +# Displays the content of the DataFrame +showDF(df) --- End diff -- sorry, just ignore above one I re-build with `build/mvn -DskipTests -Psparkr package ` and everything works... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r70010656 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import scala.util.Random + +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.catalyst.expressions.{UnsafeArrayData, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeArrayWriter} +import org.apache.spark.util.Benchmark + +/** + * Benchmark [[UnsafeArrayDataBenchmark]] for UnsafeArrayData + * To run this: + * 1. replace ignore(...) with test(...) + * 2. build/sbt "sql/test-only *benchmark.UnsafeArrayDataBenchmark" + * + * Benchmarks in this file are skipped in normal builds. + */ +class UnsafeArrayDataBenchmark extends BenchmarkBase { + + def calculateHeaderPortionInBytes(count: Int) : Int = { +/* 4 + 4 * count // Use this expression for SPARK-15962 */ +UnsafeArrayData.calculateHeaderPortionInBytes(count) + } + + def readUnsafeArray(iters: Int): Unit = { +val count = 1024 * 1024 * 16 +val rand = new Random(42) + +val intPrimitiveArray = Array.fill[Int](count) { rand.nextInt } +val intEncoder = ExpressionEncoder[Array[Int]].resolveAndBind() +val intUnsafeArray = intEncoder.toRow(intPrimitiveArray).getArray(0) +val readIntArray = { i: Int => + var n = 0 + while (n < iters) { +val len = intUnsafeArray.numElements +var sum = 0 +var i = 0 +while (i < len) { + sum += intUnsafeArray.getInt(i) + i += 1 +} +n += 1 + } +} + +val doublePrimitiveArray = Array.fill[Double](count) { rand.nextDouble } +val doubleEncoder = ExpressionEncoder[Array[Double]].resolveAndBind() +val doubleUnsafeArray = doubleEncoder.toRow(doublePrimitiveArray).getArray(0) +val readDoubleArray = { i: Int => + var n = 0 + while (n < iters) { +val len = doubleUnsafeArray.numElements +var sum = 0.0 +var i = 0 +while (i < len) { + sum += doubleUnsafeArray.getDouble(i) + i += 1 +} +n += 1 + } +} + +val benchmark = new Benchmark("Read UnsafeArrayData", count * iters) +benchmark.addCase("Int")(readIntArray) +benchmark.addCase("Double")(readDoubleArray) +benchmark.run +/* +Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.4 +Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz + +Read UnsafeArrayData:Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative + +Int279 / 294600.4 1.7 1.0X +Double 296 / 303567.0 1.8 0.9X +*/ + } + + def writeUnsafeArray(iters: Int): Unit = { +val count = 1024 * 1024 * 16 +val rand = new Random(42) + +var intTotalLength: Int = 0 +val intPrimitiveArray = Array.fill[Int](count) { rand.nextInt } +val intEncoder = ExpressionEncoder[Array[Int]].resolveAndBind() +val writeIntArray = { i: Int => + var n = 0 + while (n < iters) { +intTotalLength += intEncoder.toRow(intPrimitiveArray).getArray(0).numElements() +n += 1 + } +} + +var doubleTotalLength: Int = 0 +val doublePrimitiveArray = Array.fill[Double](count) { rand.nextDouble } +val doubleEncoder =
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r70010528 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java --- @@ -341,63 +327,115 @@ public UnsafeArrayData copy() { return arrayCopy; } - public static UnsafeArrayData fromPrimitiveArray(int[] arr) { -if (arr.length > (Integer.MAX_VALUE - 4) / 8) { - throw new UnsupportedOperationException("Cannot convert this array to unsafe format as " + -"it's too big."); -} + @Override + public boolean[] toBooleanArray() { +int size = numElements(); +boolean[] values = new boolean[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.BOOLEAN_ARRAY_OFFSET, size); +return values; + } -final int offsetRegionSize = 4 * arr.length; -final int valueRegionSize = 4 * arr.length; -final int totalSize = 4 + offsetRegionSize + valueRegionSize; -final byte[] data = new byte[totalSize]; + @Override + public byte[] toByteArray() { +int size = numElements(); +byte[] values = new byte[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.BYTE_ARRAY_OFFSET, size); +return values; + } -Platform.putInt(data, Platform.BYTE_ARRAY_OFFSET, arr.length); + @Override + public short[] toShortArray() { +int size = numElements(); +short[] values = new short[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.SHORT_ARRAY_OFFSET, size * 2); +return values; + } -int offsetPosition = Platform.BYTE_ARRAY_OFFSET + 4; -int valueOffset = 4 + offsetRegionSize; -for (int i = 0; i < arr.length; i++) { - Platform.putInt(data, offsetPosition, valueOffset); - offsetPosition += 4; - valueOffset += 4; -} + @Override + public int[] toIntArray() { +int size = numElements(); +int[] values = new int[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.INT_ARRAY_OFFSET, size * 4); +return values; + } -Platform.copyMemory(arr, Platform.INT_ARRAY_OFFSET, data, - Platform.BYTE_ARRAY_OFFSET + 4 + offsetRegionSize, valueRegionSize); + @Override + public long[] toLongArray() { +int size = numElements(); +long[] values = new long[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.LONG_ARRAY_OFFSET, size * 8); +return values; + } -UnsafeArrayData result = new UnsafeArrayData(); -result.pointTo(data, Platform.BYTE_ARRAY_OFFSET, totalSize); -return result; + @Override + public float[] toFloatArray() { +int size = numElements(); +float[] values = new float[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.FLOAT_ARRAY_OFFSET, size * 4); +return values; } - public static UnsafeArrayData fromPrimitiveArray(double[] arr) { -if (arr.length > (Integer.MAX_VALUE - 4) / 12) { + @Override + public double[] toDoubleArray() { +int size = numElements(); +double[] values = new double[size]; +Platform.copyMemory( + baseObject, baseOffset + headerInBytes, values, Platform.DOUBLE_ARRAY_OFFSET, size * 8); +return values; + } + + private static UnsafeArrayData fromPrimitiveArray( + Object arr, int offset, int length, int elementSize) { +final int headerInBytes = calculateHeaderPortionInBytes(length); +final int valueRegionInBytes = elementSize * length; +final int totalSizeInLongs = (headerInBytes + valueRegionInBytes + 7) / 8; --- End diff -- Sorry I was wrong :( We need to declare them all as long, in order to do overflow check. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14099: [SPARK-16432] Empty blocks fail to serialize due to asse...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14099 **[Test build #61944 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61944/consoleFull)** for PR 14099 at commit [`9ce8146`](https://github.com/apache/spark/commit/9ce81468b41c7515c6ce3fd4791e0ecc03a7de19). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14099: [SPARK-16432] Empty blocks fail to serialize due ...
GitHub user ericl opened a pull request: https://github.com/apache/spark/pull/14099 [SPARK-16432] Empty blocks fail to serialize due to assert in ChunkedByteBuffer ## What changes were proposed in this pull request? It's possible to also change the callers to not pass in empty chunks, but it seems cleaner to just allow `ChunkedByteBuffer` to handle empty arrays. cc @JoshRosen ## How was this patch tested? Unit tests, also checked that the original reproduction case in https://github.com/apache/spark/pull/11748#issuecomment-230760283 is resolved. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ericl/spark spark-16432 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14099.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14099 commit 9ce81468b41c7515c6ce3fd4791e0ecc03a7de19 Author: Eric LiangDate: 2016-07-08T00:18:29Z Thu Jul 7 17:18:29 PDT 2016 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14028 **[Test build #61943 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61943/consoleFull)** for PR 14028 at commit [`6570a98`](https://github.com/apache/spark/commit/6570a9874e60ecb9366ea37a0e5dfe06b821dc62). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14096 **[Test build #61942 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61942/consoleFull)** for PR 14096 at commit [`c332c52`](https://github.com/apache/spark/commit/c332c52ba7fb9e23a372a74cd2ac6ea8b3704b5d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14028: [SPARK-16351][SQL] Avoid per-record type dispatch in JSO...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14028 Thanks @yhuai! I just addressed your comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14082#discussion_r70009456 --- Diff: examples/src/main/r/RSparkSQLExample.R --- @@ -0,0 +1,175 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# $example on:init_session$ +sparkR.session() +# $example off:init_session$ + + +# $example on:create_DataFrames$ +df <- read.json("examples/src/main/resources/people.json") + +# Displays the content of the DataFrame +showDF(df) --- End diff -- Do the unit tests pass ? We have a unit test for `showDF` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14098 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14096 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14096 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61941/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14098 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61940/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14096 **[Test build #61941 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61941/consoleFull)** for PR 14096 at commit [`08672d9`](https://github.com/apache/spark/commit/08672d98c682e1bbe78399c4b0814bfa28d45826). * This patch **fails R style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14098 **[Test build #61940 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61940/consoleFull)** for PR 14098 at commit [`d92d933`](https://github.com/apache/spark/commit/d92d933937571b65670a2a308c936c0d9061b382). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/14082#discussion_r70008864 --- Diff: examples/src/main/r/RSparkSQLExample.R --- @@ -0,0 +1,175 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# $example on:init_session$ +sparkR.session() +# $example off:init_session$ + + +# $example on:create_DataFrames$ +df <- read.json("examples/src/main/resources/people.json") + +# Displays the content of the DataFrame +showDF(df) --- End diff -- I just ran `showDF()` and it seems this method not working, while `head()` works fine. Is it my problem when building sparkR by `build/mvn -DskipTests -Psparkr -Phive package`? ``` > df <- read.json("examples/src/main/resources/people.json") > showDF(df) 16/07/07 17:02:54 WARN RBackendHandler: cannot find matching method class org.apache.spark.sql.Dataset.showString. Candidates are: 16/07/07 17:02:54 WARN RBackendHandler: showString(int,int) 16/07/07 17:02:54 ERROR RBackendHandler: showString on 7 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : > showDF(select(df, "name")) 16/07/07 17:03:38 WARN RBackendHandler: cannot find matching method class org.apache.spark.sql.Dataset.showString. Candidates are: 16/07/07 17:03:38 WARN RBackendHandler: showString(int,int) 16/07/07 17:03:38 ERROR RBackendHandler: showString on 18 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14096 **[Test build #61941 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61941/consoleFull)** for PR 14096 at commit [`08672d9`](https://github.com/apache/spark/commit/08672d98c682e1bbe78399c4b0814bfa28d45826). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r70008649 --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala --- @@ -0,0 +1,320 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.util.concurrent.atomic.AtomicReference + +import scala.collection.mutable.{HashMap, HashSet} + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging +import org.apache.spark.util.Clock +import org.apache.spark.util.SystemClock +import org.apache.spark.util.Utils + +/** + * BlacklistTracker is designed to track problematic executors and nodes. It supports blacklisting + * specific (executor, task) pairs within a stage, blacklisting entire executors and nodes for a + * stage, and blacklisting executors and nodes across an entire application (with a periodic + * expiry). + * + * The tracker needs to deal with a variety of workloads, eg.: bad user code, which may lead to many + * task failures, but that should not count against individual executors; many small stages, which + * may prevent a bad executor for having many failures within one stage, but still many failures + * over the entire application; "flaky" executors, that don't fail every task, but are still + * faulty; etc. + * + * THREADING: As with most helpers of TaskSchedulerImpl, this is not thread-safe. Though it is + * called by multiple threads, callers must already have a lock on the TaskSchedulerImpl. The + * one exception is [[nodeBlacklist()]], which can be called without holding a lock. + */ +private[scheduler] class BlacklistTracker ( +conf: SparkConf, +clock: Clock = new SystemClock()) extends Logging { + + private val MAX_TASK_FAILURES_PER_NODE = --- End diff -- I generally ask people to add new configs to `core/src/main/scala/org/apache/spark/internal/config/package.scala`, but no big deal either way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r70008186 --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala --- @@ -281,6 +300,174 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B assert(!failedTaskSet) } + test("scheduled tasks obey task and stage blacklists") { +val blacklist = mock[BlacklistTracker] +taskScheduler = setupScheduler(blacklist) +val stage0 = FakeTask.createTaskSet(numTasks = 2, stageId = 0, stageAttemptId = 0) +val stage1 = FakeTask.createTaskSet(numTasks = 2, stageId = 1, stageAttemptId = 0) +val stage2 = FakeTask.createTaskSet(numTasks = 2, stageId = 2, stageAttemptId = 0) +taskScheduler.submitTasks(stage0) +taskScheduler.submitTasks(stage1) +taskScheduler.submitTasks(stage2) + +val offers = Seq( + new WorkerOffer("executor0", "host0", 1), + new WorkerOffer("executor1", "host1", 1), + new WorkerOffer("executor2", "host1", 1), + new WorkerOffer("executor3", "host2", 10) +) + +// setup our mock blacklist: +// stage 0 is blacklisted on node "host1" +// stage 1 is blacklist on executor "executor3" --- End diff -- nit: blacklisted --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14096 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61938/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14096 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14082#discussion_r70008001 --- Diff: examples/src/main/r/RSparkSQLExample.R --- @@ -0,0 +1,175 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# $example on:init_session$ +sparkR.session() +# $example off:init_session$ + + +# $example on:create_DataFrames$ +df <- read.json("examples/src/main/resources/people.json") + +# Displays the content of the DataFrame +showDF(df) --- End diff -- I'd use `head` as the default one for most examples. It feels most natural. We can then add one line to the programming guide that reads like "You can also `showDF` to print the first few rows and optionally truncate the printing of long values" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14096 **[Test build #61938 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61938/consoleFull)** for PR 14096 at commit [`65f2236`](https://github.com/apache/spark/commit/65f2236970d42ab1ee8115a5f0102119504e633f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r70007837 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -682,11 +682,12 @@ private[spark] class ApplicationMaster( } override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = { - case RequestExecutors(requestedTotal, localityAwareTasks, hostToLocalTaskCount) => + case RequestExecutors( --- End diff -- Wonder if the argument list is really helping here. e.g.: ``` case r: RequestExecutors => a.requestTotalExecutorsWithPreferredLocalities(r.requestedTotal, ...) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14082: [SPARK-16381][SQL][SparkR] Update SQL examples an...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/14082#discussion_r70007444 --- Diff: examples/src/main/r/RSparkSQLExample.R --- @@ -0,0 +1,175 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# $example on:init_session$ +sparkR.session() +# $example off:init_session$ + + +# $example on:create_DataFrames$ +df <- read.json("examples/src/main/resources/people.json") + +# Displays the content of the DataFrame +showDF(df) --- End diff -- you mean all `showDF()` be replaced by `head()`? eg. change `showDF(select(df, "name"))` to `head(select(df, "name"))` too? or should we leave both `showDF()` and `head()` as examples to reader? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r70007341 --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala --- @@ -0,0 +1,320 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.util.concurrent.atomic.AtomicReference + +import scala.collection.mutable.{HashMap, HashSet} + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging +import org.apache.spark.util.Clock +import org.apache.spark.util.SystemClock +import org.apache.spark.util.Utils + +/** + * BlacklistTracker is designed to track problematic executors and nodes. It supports blacklisting + * specific (executor, task) pairs within a stage, blacklisting entire executors and nodes for a + * stage, and blacklisting executors and nodes across an entire application (with a periodic + * expiry). + * + * The tracker needs to deal with a variety of workloads, eg.: bad user code, which may lead to many + * task failures, but that should not count against individual executors; many small stages, which + * may prevent a bad executor for having many failures within one stage, but still many failures + * over the entire application; "flaky" executors, that don't fail every task, but are still + * faulty; etc. + * + * THREADING: As with most helpers of TaskSchedulerImpl, this is not thread-safe. Though it is + * called by multiple threads, callers must already have a lock on the TaskSchedulerImpl. The + * one exception is [[nodeBlacklist()]], which can be called without holding a lock. + */ +private[scheduler] class BlacklistTracker ( +conf: SparkConf, +clock: Clock = new SystemClock()) extends Logging { + + private val MAX_TASK_FAILURES_PER_NODE = +conf.getInt("spark.blacklist.maxTaskFailuresPerNode", 2) + private val MAX_FAILURES_PER_EXEC = +conf.getInt("spark.blacklist.maxFailedTasksPerExecutor", 2) + private val MAX_FAILURES_PER_EXEC_STAGE = +conf.getInt("spark.blacklist.maxFailedTasksPerExecutorStage", 2) + private val MAX_FAILED_EXEC_PER_NODE = +conf.getInt("spark.blacklist.maxFailedExecutorsPerNode", 2) + private val MAX_FAILED_EXEC_PER_NODE_STAGE = +conf.getInt("spark.blacklist.maxFailedExecutorsPerNodeStage", 2) + val EXECUTOR_RECOVERY_MILLIS = BlacklistTracker.getBlacklistExpiryTime(conf) + + // a count of failed tasks for each executor. Only counts failures after tasksets complete + // successfully + private val executorIdToFailureCount: HashMap[String, Int] = HashMap() + // failures for each executor by stage. Only tracked while the stage is running. + val stageIdToExecToFailures: HashMap[Int, HashMap[String, FailureStatus]] = +new HashMap() + val stageIdToNodeBlacklistedTasks: HashMap[Int, HashMap[String, HashSet[Int]]] = +new HashMap() + val stageIdToBlacklistedNodes: HashMap[Int, HashSet[String]] = new HashMap() + private val executorIdToBlacklistExpiryTime: HashMap[String, Long] = new HashMap() + private val nodeIdToBlacklistExpiryTime: HashMap[String, Long] = new HashMap() + private val _nodeBlacklist: AtomicReference[Set[String]] = new AtomicReference(Set()) + private var nextExpiryTime: Long = Long.MaxValue + + def start(): Unit = {} + + def stop(): Unit = {} + + def expireExecutorsInBlacklist(): Unit = { +val now = clock.getTimeMillis() +// quickly check if we've got anything to expire from blacklist -- if not, avoid doing any work +if (now > nextExpiryTime) { + val execsToClear = executorIdToBlacklistExpiryTime.filter(_._2 < now).keys + if (execsToClear.nonEmpty) { +logInfo(s"Removing executors $execsToClear from blacklist during periodic recovery") +execsToClear.foreach { exec => executorIdToBlacklistExpiryTime.remove(exec) } + }
[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14098 **[Test build #61940 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61940/consoleFull)** for PR 14098 at commit [`d92d933`](https://github.com/apache/spark/commit/d92d933937571b65670a2a308c936c0d9061b382). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r70007113 --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala --- @@ -0,0 +1,320 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.util.concurrent.atomic.AtomicReference + +import scala.collection.mutable.{HashMap, HashSet} + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging +import org.apache.spark.util.Clock +import org.apache.spark.util.SystemClock +import org.apache.spark.util.Utils + +/** + * BlacklistTracker is designed to track problematic executors and nodes. It supports blacklisting + * specific (executor, task) pairs within a stage, blacklisting entire executors and nodes for a + * stage, and blacklisting executors and nodes across an entire application (with a periodic + * expiry). + * + * The tracker needs to deal with a variety of workloads, eg.: bad user code, which may lead to many + * task failures, but that should not count against individual executors; many small stages, which + * may prevent a bad executor for having many failures within one stage, but still many failures + * over the entire application; "flaky" executors, that don't fail every task, but are still + * faulty; etc. + * + * THREADING: As with most helpers of TaskSchedulerImpl, this is not thread-safe. Though it is + * called by multiple threads, callers must already have a lock on the TaskSchedulerImpl. The + * one exception is [[nodeBlacklist()]], which can be called without holding a lock. + */ +private[scheduler] class BlacklistTracker ( +conf: SparkConf, +clock: Clock = new SystemClock()) extends Logging { + + private val MAX_TASK_FAILURES_PER_NODE = +conf.getInt("spark.blacklist.maxTaskFailuresPerNode", 2) + private val MAX_FAILURES_PER_EXEC = +conf.getInt("spark.blacklist.maxFailedTasksPerExecutor", 2) + private val MAX_FAILURES_PER_EXEC_STAGE = +conf.getInt("spark.blacklist.maxFailedTasksPerExecutorStage", 2) + private val MAX_FAILED_EXEC_PER_NODE = +conf.getInt("spark.blacklist.maxFailedExecutorsPerNode", 2) + private val MAX_FAILED_EXEC_PER_NODE_STAGE = +conf.getInt("spark.blacklist.maxFailedExecutorsPerNodeStage", 2) + val EXECUTOR_RECOVERY_MILLIS = BlacklistTracker.getBlacklistExpiryTime(conf) + + // a count of failed tasks for each executor. Only counts failures after tasksets complete + // successfully + private val executorIdToFailureCount: HashMap[String, Int] = HashMap() + // failures for each executor by stage. Only tracked while the stage is running. + val stageIdToExecToFailures: HashMap[Int, HashMap[String, FailureStatus]] = +new HashMap() + val stageIdToNodeBlacklistedTasks: HashMap[Int, HashMap[String, HashSet[Int]]] = +new HashMap() + val stageIdToBlacklistedNodes: HashMap[Int, HashSet[String]] = new HashMap() + private val executorIdToBlacklistExpiryTime: HashMap[String, Long] = new HashMap() + private val nodeIdToBlacklistExpiryTime: HashMap[String, Long] = new HashMap() + private val _nodeBlacklist: AtomicReference[Set[String]] = new AtomicReference(Set()) + private var nextExpiryTime: Long = Long.MaxValue + + def start(): Unit = {} + + def stop(): Unit = {} + + def expireExecutorsInBlacklist(): Unit = { +val now = clock.getTimeMillis() +// quickly check if we've got anything to expire from blacklist -- if not, avoid doing any work +if (now > nextExpiryTime) { + val execsToClear = executorIdToBlacklistExpiryTime.filter(_._2 < now).keys + if (execsToClear.nonEmpty) { +logInfo(s"Removing executors $execsToClear from blacklist during periodic recovery") +execsToClear.foreach { exec => executorIdToBlacklistExpiryTime.remove(exec) } + }
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r70006954 --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala --- @@ -0,0 +1,320 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.util.concurrent.atomic.AtomicReference + +import scala.collection.mutable.{HashMap, HashSet} + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging +import org.apache.spark.util.Clock +import org.apache.spark.util.SystemClock +import org.apache.spark.util.Utils + +/** + * BlacklistTracker is designed to track problematic executors and nodes. It supports blacklisting + * specific (executor, task) pairs within a stage, blacklisting entire executors and nodes for a + * stage, and blacklisting executors and nodes across an entire application (with a periodic + * expiry). + * + * The tracker needs to deal with a variety of workloads, eg.: bad user code, which may lead to many + * task failures, but that should not count against individual executors; many small stages, which + * may prevent a bad executor for having many failures within one stage, but still many failures + * over the entire application; "flaky" executors, that don't fail every task, but are still + * faulty; etc. + * + * THREADING: As with most helpers of TaskSchedulerImpl, this is not thread-safe. Though it is + * called by multiple threads, callers must already have a lock on the TaskSchedulerImpl. The + * one exception is [[nodeBlacklist()]], which can be called without holding a lock. + */ +private[scheduler] class BlacklistTracker ( +conf: SparkConf, +clock: Clock = new SystemClock()) extends Logging { + + private val MAX_TASK_FAILURES_PER_NODE = +conf.getInt("spark.blacklist.maxTaskFailuresPerNode", 2) + private val MAX_FAILURES_PER_EXEC = +conf.getInt("spark.blacklist.maxFailedTasksPerExecutor", 2) + private val MAX_FAILURES_PER_EXEC_STAGE = +conf.getInt("spark.blacklist.maxFailedTasksPerExecutorStage", 2) + private val MAX_FAILED_EXEC_PER_NODE = +conf.getInt("spark.blacklist.maxFailedExecutorsPerNode", 2) + private val MAX_FAILED_EXEC_PER_NODE_STAGE = +conf.getInt("spark.blacklist.maxFailedExecutorsPerNodeStage", 2) + val EXECUTOR_RECOVERY_MILLIS = BlacklistTracker.getBlacklistExpiryTime(conf) + + // a count of failed tasks for each executor. Only counts failures after tasksets complete + // successfully + private val executorIdToFailureCount: HashMap[String, Int] = HashMap() + // failures for each executor by stage. Only tracked while the stage is running. + val stageIdToExecToFailures: HashMap[Int, HashMap[String, FailureStatus]] = +new HashMap() + val stageIdToNodeBlacklistedTasks: HashMap[Int, HashMap[String, HashSet[Int]]] = +new HashMap() + val stageIdToBlacklistedNodes: HashMap[Int, HashSet[String]] = new HashMap() + private val executorIdToBlacklistExpiryTime: HashMap[String, Long] = new HashMap() + private val nodeIdToBlacklistExpiryTime: HashMap[String, Long] = new HashMap() + private val _nodeBlacklist: AtomicReference[Set[String]] = new AtomicReference(Set()) + private var nextExpiryTime: Long = Long.MaxValue + + def start(): Unit = {} + + def stop(): Unit = {} + + def expireExecutorsInBlacklist(): Unit = { +val now = clock.getTimeMillis() +// quickly check if we've got anything to expire from blacklist -- if not, avoid doing any work +if (now > nextExpiryTime) { + val execsToClear = executorIdToBlacklistExpiryTime.filter(_._2 < now).keys + if (execsToClear.nonEmpty) { +logInfo(s"Removing executors $execsToClear from blacklist during periodic recovery") +execsToClear.foreach { exec => executorIdToBlacklistExpiryTime.remove(exec) } + }
[GitHub] spark pull request #14096: [SPARK-16425][R] `describe()` should not fail wit...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14096#discussion_r70006844 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1804,11 +1804,11 @@ test_that("describe() and summarize() on a DataFrame", { expect_equal(collect(stats)[2, "age"], "24.5") expect_equal(collect(stats)[3, "age"], "7.7781745930520225") stats <- describe(df) - expect_equal(collect(stats)[4, "name"], "Andy") + expect_equal(columns(stats), c("summary", "age")) --- End diff -- Sure. I agree. That would be better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14096: [SPARK-16425][R] `describe()` should not fail wit...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14096#discussion_r70006591 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1804,11 +1804,11 @@ test_that("describe() and summarize() on a DataFrame", { expect_equal(collect(stats)[2, "age"], "24.5") expect_equal(collect(stats)[3, "age"], "7.7781745930520225") stats <- describe(df) - expect_equal(collect(stats)[4, "name"], "Andy") + expect_equal(columns(stats), c("summary", "age")) --- End diff -- Sorry one last thing - instead of removing the previous test can we add a new assert ? Also maybe can we add the failing test case from the JIRA as a new test case ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14096 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61935/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14096 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14096 **[Test build #61935 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61935/consoleFull)** for PR 14096 at commit [`97a158e`](https://github.com/apache/spark/commit/97a158e9e91c800b2be7682d6cc77b86d047626c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14097: [MINOR][Streaming][Docs] Minor changes on kinesis integr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14097 **[Test build #61937 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61937/consoleFull)** for PR 14097 at commit [`0740b2b`](https://github.com/apache/spark/commit/0740b2b053d854f394aa2094a29dd878e4d4bd5d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14097: [MINOR][Streaming][Docs] Minor changes on kinesis integr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14097 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14097: [MINOR][Streaming][Docs] Minor changes on kinesis integr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14097 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61937/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r70005830 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -236,29 +246,43 @@ private[spark] class TaskSchedulerImpl( * given TaskSetManager have completed, so state associated with the TaskSetManager should be * cleaned up. */ - def taskSetFinished(manager: TaskSetManager): Unit = synchronized { + def taskSetFinished(manager: TaskSetManager, success: Boolean): Unit = synchronized { taskSetsByStageIdAndAttempt.get(manager.taskSet.stageId).foreach { taskSetsForStage => taskSetsForStage -= manager.taskSet.stageAttemptId if (taskSetsForStage.isEmpty) { taskSetsByStageIdAndAttempt -= manager.taskSet.stageId } } manager.parent.removeSchedulable(manager) -logInfo("Removed TaskSet %s, whose tasks have all completed, from pool %s" - .format(manager.taskSet.id, manager.parent.name)) +if (success) { + blacklistTracker.map(_.taskSetSucceeded(manager.taskSet.stageId, this)) --- End diff -- `.foreach`. This happens in other places too, I'll refrain from pointing out all of them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14098 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14098 **[Test build #61939 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61939/consoleFull)** for PR 14098 at commit [`94df090`](https://github.com/apache/spark/commit/94df0908d17d28eb6c0223456df7daad24899e47). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14098 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61939/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r70005741 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -52,13 +51,23 @@ import org.apache.spark.util.{AccumulatorV2, ThreadUtils, Utils} * acquire a lock on us, so we need to make sure that we don't try to lock the backend while * we are holding a lock on ourselves. */ -private[spark] class TaskSchedulerImpl( +private[spark] class TaskSchedulerImpl private[scheduler]( val sc: SparkContext, val maxTaskFailures: Int, +private[scheduler] val blacklistTracker: Option[BlacklistTracker], +private val clock: Clock = new SystemClock, isLocal: Boolean = false) extends TaskScheduler with Logging { - def this(sc: SparkContext) = this(sc, sc.conf.getInt("spark.task.maxFailures", 4)) + def this(sc: SparkContext) = { +this(sc, sc.conf.getInt("spark.task.maxFailures", 4), + TaskSchedulerImpl.createBlacklistTracker(sc.conf)) + } + + def this(sc: SparkContext, maxTaskFailures: Int, isLocal: Boolean) = { +this(sc, maxTaskFailures, TaskSchedulerImpl.createBlacklistTracker(sc.conf), + clock = new SystemClock, isLocal) --- End diff -- nit: also use arg name for `isLocal`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r70005706 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -153,6 +162,7 @@ private[spark] class TaskSchedulerImpl( override def start() { backend.start() +blacklistTracker.map(_.start()) --- End diff -- `.foreach` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14098 **[Test build #61939 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61939/consoleFull)** for PR 14098 at commit [`94df090`](https://github.com/apache/spark/commit/94df0908d17d28eb6c0223456df7daad24899e47). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14088: [SPARK-16414] [YARN] Fix bugs for "Can not get user conf...
Github user sharkdtu commented on the issue: https://github.com/apache/spark/pull/14088 @tgravescs fixed the description and style --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/14098 @liancheng Can you review it? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r70005525 --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala --- @@ -0,0 +1,320 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.util.concurrent.atomic.AtomicReference + +import scala.collection.mutable.{HashMap, HashSet} + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging +import org.apache.spark.util.Clock +import org.apache.spark.util.SystemClock +import org.apache.spark.util.Utils + +/** + * BlacklistTracker is designed to track problematic executors and nodes. It supports blacklisting + * specific (executor, task) pairs within a stage, blacklisting entire executors and nodes for a + * stage, and blacklisting executors and nodes across an entire application (with a periodic + * expiry). + * + * The tracker needs to deal with a variety of workloads, eg.: bad user code, which may lead to many + * task failures, but that should not count against individual executors; many small stages, which + * may prevent a bad executor for having many failures within one stage, but still many failures + * over the entire application; "flaky" executors, that don't fail every task, but are still + * faulty; etc. + * + * THREADING: As with most helpers of TaskSchedulerImpl, this is not thread-safe. Though it is + * called by multiple threads, callers must already have a lock on the TaskSchedulerImpl. The + * one exception is [[nodeBlacklist()]], which can be called without holding a lock. + */ +private[scheduler] class BlacklistTracker ( +conf: SparkConf, +clock: Clock = new SystemClock()) extends Logging { + + private val MAX_TASK_FAILURES_PER_NODE = +conf.getInt("spark.blacklist.maxTaskFailuresPerNode", 2) + private val MAX_FAILURES_PER_EXEC = +conf.getInt("spark.blacklist.maxFailedTasksPerExecutor", 2) + private val MAX_FAILURES_PER_EXEC_STAGE = +conf.getInt("spark.blacklist.maxFailedTasksPerExecutorStage", 2) + private val MAX_FAILED_EXEC_PER_NODE = +conf.getInt("spark.blacklist.maxFailedExecutorsPerNode", 2) + private val MAX_FAILED_EXEC_PER_NODE_STAGE = +conf.getInt("spark.blacklist.maxFailedExecutorsPerNodeStage", 2) + val EXECUTOR_RECOVERY_MILLIS = BlacklistTracker.getBlacklistExpiryTime(conf) + + // a count of failed tasks for each executor. Only counts failures after tasksets complete + // successfully + private val executorIdToFailureCount: HashMap[String, Int] = HashMap() + // failures for each executor by stage. Only tracked while the stage is running. + val stageIdToExecToFailures: HashMap[Int, HashMap[String, FailureStatus]] = +new HashMap() + val stageIdToNodeBlacklistedTasks: HashMap[Int, HashMap[String, HashSet[Int]]] = +new HashMap() + val stageIdToBlacklistedNodes: HashMap[Int, HashSet[String]] = new HashMap() + private val executorIdToBlacklistExpiryTime: HashMap[String, Long] = new HashMap() + private val nodeIdToBlacklistExpiryTime: HashMap[String, Long] = new HashMap() + private val _nodeBlacklist: AtomicReference[Set[String]] = new AtomicReference(Set()) + private var nextExpiryTime: Long = Long.MaxValue + + def start(): Unit = {} + + def stop(): Unit = {} + + def expireExecutorsInBlacklist(): Unit = { +val now = clock.getTimeMillis() +// quickly check if we've got anything to expire from blacklist -- if not, avoid doing any work +if (now > nextExpiryTime) { + val execsToClear = executorIdToBlacklistExpiryTime.filter(_._2 < now).keys + if (execsToClear.nonEmpty) { +logInfo(s"Removing executors $execsToClear from blacklist during periodic recovery") +execsToClear.foreach { exec => executorIdToBlacklistExpiryTime.remove(exec) } + }
[GitHub] spark pull request #14098: [SPARK-16380][SQL][Example]:Update SQL examples a...
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/14098 [SPARK-16380][SQL][Example]:Update SQL examples and programming guide for Python language binding ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) In current sql-programming-guide.md, Python examples are hard coded in the md file. I update the file by adding a separate SparkSQLExample.py as ml examples. In this file, I included all working and hard-coded examples as a self-contained application, except for Hive examples. For example, spark.refershtable, which doesn't exist in SparkSession. We can revisit these examples and put it in the self-contained application. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual tests: ./bin/spark-submit examples/src/main/python/SparkSQLExample.py Build docs and check generated document including correct examples as ml document. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark sql Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14098.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14098 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should not fail with non-n...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14096 **[Test build #61938 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61938/consoleFull)** for PR 14096 at commit [`65f2236`](https://github.com/apache/spark/commit/65f2236970d42ab1ee8115a5f0102119504e633f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r70005272 --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala --- @@ -0,0 +1,320 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.util.concurrent.atomic.AtomicReference + +import scala.collection.mutable.{HashMap, HashSet} + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging +import org.apache.spark.util.Clock +import org.apache.spark.util.SystemClock +import org.apache.spark.util.Utils + +/** + * BlacklistTracker is designed to track problematic executors and nodes. It supports blacklisting + * specific (executor, task) pairs within a stage, blacklisting entire executors and nodes for a + * stage, and blacklisting executors and nodes across an entire application (with a periodic + * expiry). + * + * The tracker needs to deal with a variety of workloads, eg.: bad user code, which may lead to many + * task failures, but that should not count against individual executors; many small stages, which + * may prevent a bad executor for having many failures within one stage, but still many failures + * over the entire application; "flaky" executors, that don't fail every task, but are still + * faulty; etc. + * + * THREADING: As with most helpers of TaskSchedulerImpl, this is not thread-safe. Though it is + * called by multiple threads, callers must already have a lock on the TaskSchedulerImpl. The + * one exception is [[nodeBlacklist()]], which can be called without holding a lock. + */ +private[scheduler] class BlacklistTracker ( +conf: SparkConf, +clock: Clock = new SystemClock()) extends Logging { + + private val MAX_TASK_FAILURES_PER_NODE = +conf.getInt("spark.blacklist.maxTaskFailuresPerNode", 2) + private val MAX_FAILURES_PER_EXEC = +conf.getInt("spark.blacklist.maxFailedTasksPerExecutor", 2) + private val MAX_FAILURES_PER_EXEC_STAGE = +conf.getInt("spark.blacklist.maxFailedTasksPerExecutorStage", 2) + private val MAX_FAILED_EXEC_PER_NODE = +conf.getInt("spark.blacklist.maxFailedExecutorsPerNode", 2) + private val MAX_FAILED_EXEC_PER_NODE_STAGE = +conf.getInt("spark.blacklist.maxFailedExecutorsPerNodeStage", 2) + val EXECUTOR_RECOVERY_MILLIS = BlacklistTracker.getBlacklistExpiryTime(conf) + + // a count of failed tasks for each executor. Only counts failures after tasksets complete + // successfully + private val executorIdToFailureCount: HashMap[String, Int] = HashMap() + // failures for each executor by stage. Only tracked while the stage is running. + val stageIdToExecToFailures: HashMap[Int, HashMap[String, FailureStatus]] = +new HashMap() + val stageIdToNodeBlacklistedTasks: HashMap[Int, HashMap[String, HashSet[Int]]] = +new HashMap() + val stageIdToBlacklistedNodes: HashMap[Int, HashSet[String]] = new HashMap() + private val executorIdToBlacklistExpiryTime: HashMap[String, Long] = new HashMap() + private val nodeIdToBlacklistExpiryTime: HashMap[String, Long] = new HashMap() + private val _nodeBlacklist: AtomicReference[Set[String]] = new AtomicReference(Set()) + private var nextExpiryTime: Long = Long.MaxValue + + def start(): Unit = {} + + def stop(): Unit = {} + + def expireExecutorsInBlacklist(): Unit = { +val now = clock.getTimeMillis() +// quickly check if we've got anything to expire from blacklist -- if not, avoid doing any work +if (now > nextExpiryTime) { + val execsToClear = executorIdToBlacklistExpiryTime.filter(_._2 < now).keys + if (execsToClear.nonEmpty) { +logInfo(s"Removing executors $execsToClear from blacklist during periodic recovery") +execsToClear.foreach { exec => executorIdToBlacklistExpiryTime.remove(exec) } + }
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r70005147 --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala --- @@ -0,0 +1,320 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.util.concurrent.atomic.AtomicReference + +import scala.collection.mutable.{HashMap, HashSet} + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging +import org.apache.spark.util.Clock +import org.apache.spark.util.SystemClock +import org.apache.spark.util.Utils + +/** + * BlacklistTracker is designed to track problematic executors and nodes. It supports blacklisting + * specific (executor, task) pairs within a stage, blacklisting entire executors and nodes for a + * stage, and blacklisting executors and nodes across an entire application (with a periodic + * expiry). + * + * The tracker needs to deal with a variety of workloads, eg.: bad user code, which may lead to many + * task failures, but that should not count against individual executors; many small stages, which + * may prevent a bad executor for having many failures within one stage, but still many failures + * over the entire application; "flaky" executors, that don't fail every task, but are still + * faulty; etc. + * + * THREADING: As with most helpers of TaskSchedulerImpl, this is not thread-safe. Though it is + * called by multiple threads, callers must already have a lock on the TaskSchedulerImpl. The + * one exception is [[nodeBlacklist()]], which can be called without holding a lock. + */ +private[scheduler] class BlacklistTracker ( +conf: SparkConf, +clock: Clock = new SystemClock()) extends Logging { + + private val MAX_TASK_FAILURES_PER_NODE = +conf.getInt("spark.blacklist.maxTaskFailuresPerNode", 2) + private val MAX_FAILURES_PER_EXEC = +conf.getInt("spark.blacklist.maxFailedTasksPerExecutor", 2) + private val MAX_FAILURES_PER_EXEC_STAGE = +conf.getInt("spark.blacklist.maxFailedTasksPerExecutorStage", 2) + private val MAX_FAILED_EXEC_PER_NODE = +conf.getInt("spark.blacklist.maxFailedExecutorsPerNode", 2) + private val MAX_FAILED_EXEC_PER_NODE_STAGE = +conf.getInt("spark.blacklist.maxFailedExecutorsPerNodeStage", 2) + val EXECUTOR_RECOVERY_MILLIS = BlacklistTracker.getBlacklistExpiryTime(conf) + + // a count of failed tasks for each executor. Only counts failures after tasksets complete + // successfully + private val executorIdToFailureCount: HashMap[String, Int] = HashMap() + // failures for each executor by stage. Only tracked while the stage is running. + val stageIdToExecToFailures: HashMap[Int, HashMap[String, FailureStatus]] = +new HashMap() + val stageIdToNodeBlacklistedTasks: HashMap[Int, HashMap[String, HashSet[Int]]] = +new HashMap() + val stageIdToBlacklistedNodes: HashMap[Int, HashSet[String]] = new HashMap() + private val executorIdToBlacklistExpiryTime: HashMap[String, Long] = new HashMap() + private val nodeIdToBlacklistExpiryTime: HashMap[String, Long] = new HashMap() + private val _nodeBlacklist: AtomicReference[Set[String]] = new AtomicReference(Set()) + private var nextExpiryTime: Long = Long.MaxValue + + def start(): Unit = {} + + def stop(): Unit = {} + + def expireExecutorsInBlacklist(): Unit = { +val now = clock.getTimeMillis() +// quickly check if we've got anything to expire from blacklist -- if not, avoid doing any work +if (now > nextExpiryTime) { + val execsToClear = executorIdToBlacklistExpiryTime.filter(_._2 < now).keys + if (execsToClear.nonEmpty) { +logInfo(s"Removing executors $execsToClear from blacklist during periodic recovery") +execsToClear.foreach { exec => executorIdToBlacklistExpiryTime.remove(exec) } + }
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r70004452 --- Diff: core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala --- @@ -0,0 +1,320 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.util.concurrent.atomic.AtomicReference + +import scala.collection.mutable.{HashMap, HashSet} + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging +import org.apache.spark.util.Clock +import org.apache.spark.util.SystemClock +import org.apache.spark.util.Utils + +/** + * BlacklistTracker is designed to track problematic executors and nodes. It supports blacklisting + * specific (executor, task) pairs within a stage, blacklisting entire executors and nodes for a + * stage, and blacklisting executors and nodes across an entire application (with a periodic + * expiry). + * + * The tracker needs to deal with a variety of workloads, eg.: bad user code, which may lead to many + * task failures, but that should not count against individual executors; many small stages, which + * may prevent a bad executor for having many failures within one stage, but still many failures + * over the entire application; "flaky" executors, that don't fail every task, but are still + * faulty; etc. + * + * THREADING: As with most helpers of TaskSchedulerImpl, this is not thread-safe. Though it is + * called by multiple threads, callers must already have a lock on the TaskSchedulerImpl. The + * one exception is [[nodeBlacklist()]], which can be called without holding a lock. + */ +private[scheduler] class BlacklistTracker ( +conf: SparkConf, +clock: Clock = new SystemClock()) extends Logging { + + private val MAX_TASK_FAILURES_PER_NODE = +conf.getInt("spark.blacklist.maxTaskFailuresPerNode", 2) + private val MAX_FAILURES_PER_EXEC = +conf.getInt("spark.blacklist.maxFailedTasksPerExecutor", 2) + private val MAX_FAILURES_PER_EXEC_STAGE = +conf.getInt("spark.blacklist.maxFailedTasksPerExecutorStage", 2) + private val MAX_FAILED_EXEC_PER_NODE = +conf.getInt("spark.blacklist.maxFailedExecutorsPerNode", 2) + private val MAX_FAILED_EXEC_PER_NODE_STAGE = +conf.getInt("spark.blacklist.maxFailedExecutorsPerNodeStage", 2) + val EXECUTOR_RECOVERY_MILLIS = BlacklistTracker.getBlacklistExpiryTime(conf) + + // a count of failed tasks for each executor. Only counts failures after tasksets complete + // successfully + private val executorIdToFailureCount: HashMap[String, Int] = HashMap() + // failures for each executor by stage. Only tracked while the stage is running. + val stageIdToExecToFailures: HashMap[Int, HashMap[String, FailureStatus]] = +new HashMap() + val stageIdToNodeBlacklistedTasks: HashMap[Int, HashMap[String, HashSet[Int]]] = +new HashMap() + val stageIdToBlacklistedNodes: HashMap[Int, HashSet[String]] = new HashMap() + private val executorIdToBlacklistExpiryTime: HashMap[String, Long] = new HashMap() + private val nodeIdToBlacklistExpiryTime: HashMap[String, Long] = new HashMap() + private val _nodeBlacklist: AtomicReference[Set[String]] = new AtomicReference(Set()) + private var nextExpiryTime: Long = Long.MaxValue + + def start(): Unit = {} + + def stop(): Unit = {} + + def expireExecutorsInBlacklist(): Unit = { +val now = clock.getTimeMillis() +// quickly check if we've got anything to expire from blacklist -- if not, avoid doing any work +if (now > nextExpiryTime) { + val execsToClear = executorIdToBlacklistExpiryTime.filter(_._2 < now).keys + if (execsToClear.nonEmpty) { +logInfo(s"Removing executors $execsToClear from blacklist during periodic recovery") +execsToClear.foreach { exec => executorIdToBlacklistExpiryTime.remove(exec) } + }
[GitHub] spark pull request #14096: [SPARK-16425][R] `describe()` should not fail wit...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14096#discussion_r7000 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1804,11 +1804,11 @@ test_that("describe() and summarize() on a DataFrame", { expect_equal(collect(stats)[2, "age"], "24.5") expect_equal(collect(stats)[3, "age"], "7.7781745930520225") stats <- describe(df) - expect_equal(collect(stats)[4, "name"], "Andy") + expect_equal(columns(summary(df)), c("summary", "age")) expect_equal(collect(stats)[5, "age"], "30") stats2 <- summary(df) - expect_equal(collect(stats2)[4, "name"], "Andy") + expect_equal(columns(summary(df)), c("summary", "age")) --- End diff -- And, here, too. I'll fix soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14071#discussion_r70004411 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -939,42 +940,33 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { // to include the partition columns here explicitly val schema = cols ++ partitionCols -// Storage format -val defaultStorage: CatalogStorageFormat = { - val defaultStorageType = conf.getConfString("hive.default.fileformat", "textfile") - val defaultHiveSerde = HiveSerDe.sourceToSerDe(defaultStorageType, conf) - CatalogStorageFormat( -locationUri = None, -inputFormat = defaultHiveSerde.flatMap(_.inputFormat) - .orElse(Some("org.apache.hadoop.mapred.TextInputFormat")), -outputFormat = defaultHiveSerde.flatMap(_.outputFormat) - .orElse(Some("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat")), -// Note: Keep this unspecified because we use the presence of the serde to decide -// whether to convert a table created by CTAS to a datasource table. -serde = None, -compressed = false, -serdeProperties = Map()) -} validateRowFormatFileFormat(ctx.rowFormat, ctx.createFileFormat, ctx) -val fileStorage = Option(ctx.createFileFormat).map(visitCreateFileFormat) - .getOrElse(CatalogStorageFormat.empty) -val rowStorage = Option(ctx.rowFormat).map(visitRowFormat) - .getOrElse(CatalogStorageFormat.empty) -val location = Option(ctx.locationSpec).map(visitLocationSpec) +var storage = CatalogStorageFormat( + locationUri = Option(ctx.locationSpec).map(visitLocationSpec), + provider = Some("hive"), + properties = Map.empty) +Option(ctx.createFileFormat).foreach(ctx => storage = getFileFormat(ctx, storage)) +Option(ctx.rowFormat).foreach(ctx => storage = getRowFormat(ctx, storage)) --- End diff -- uh,,, `defaultStorage.serde` is still `None`. Then, I think it should be fine. Maybe we just need to add a comment to explain how we determine the value of `serde`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org