[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19086 yes, correct --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19086: [SPARK-21874][SQL] Support changing database when...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19086#discussion_r136747432 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -569,46 +569,51 @@ class SessionCatalog( /** * Rename a table. * - * If a database is specified in `oldName`, this will rename the table in that database. + * If the database specified in `newName` is different from the one specified in `oldName`, + * It will result in moving table across databases. + * * If no database is specified, this will first attempt to rename a temporary table with - * the same name, then, if that does not exist, rename the table in the current database. + * the same name, then, if that does not exist, current database will be used for locating + * `oldName` or `newName`. --- End diff -- So should I make the comment here unchanged? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19086 @gatorsmile I updated, let me known if there's still comments not resolved. Thanks again for review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19086: [SPARK-21874][SQL] Support changing database when...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19086#discussion_r136719337 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -502,17 +502,16 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat val storageWithNewPath = if (rawTable.tableType == MANAGED && hasPathOption) { // If it's a managed table with path option and we are renaming it, then the path option // becomes inaccurate and we need to update it according to the new table name. - val newTablePath = defaultTablePath(TableIdentifier(newName, Some(db))) + val newTablePath = defaultTablePath(TableIdentifier(newName, Some(newDb))) --- End diff -- https://github.com/apache/spark/pull/19086/files#diff-8c4108666a6639034f0ddbfa075bcb37R827 Is this ok? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18270 Thanks for notification. Actually we implement the same logic with hive, though there's a bug ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19086 Thanks, I will refine soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19086 @gatorsmile Thanks for taking time look at this. I updated description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18270 Thank you so much ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19086: [SPARK-21874][SQL] Support changing database when...
GitHub user jinxing64 opened a pull request: https://github.com/apache/spark/pull/19086 [SPARK-21874][SQL] Support changing database when rename table. ## What changes were proposed in this pull request? Support changing database of table by `alter table dbA.XXX rename to dbB.YYY` ## How was this patch tested? Updated existing unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jinxing64/spark SPARK-21874 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19086.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19086 commit 7865b3dbe5bc4cda263bc75f8c3524c7bb0fe81c Author: jinxing <jinxing6...@126.com> Date: 2017-08-30T03:09:57Z Support changing database when rename table. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18713: [SPARK-21509][SQL] Add a config to enable adaptive query...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18713 cc @cenyuhai As we talked offline, maybe your have interest on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18270 @gatorsmile Could you please give some ideas why the value of `grouping_id()` generated in Spark is different from `grouping__id` Hive? Is it designed on purpose? A lot of our users are using `grouping__id` in `if(...)` clause. The incompatibility between Spark and Hive is making our migration very difficult. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18866: [SPARK-21649][SQL] Support writing data into hive...
Github user jinxing64 closed the pull request at: https://github.com/apache/spark/pull/18866 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18866: [SPARK-21649][SQL] Support writing data into hive bucket...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18866 @cloud-fan Thanks for reply. Looks like #19001 continues working on this and it's more comprehensive. I will close this pr for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18270 @cenyuhai Are you still working on this? Could please fix the test? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18866: [SPARK-21649][SQL] Support writing data into hive bucket...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18866 @cloud-fan Would you give some advice on this ? Thus I can know if I'm on the right direction. I can keep working on it :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18270 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18866: [WIP][SPARK-21649][SQL] Support writing data into hive b...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18866 In current change: 1. `ClusteredDistribution` becomes ClusteredDistribution(clustering: Seq[Expression], clustersOpt: Option[Int] = None, useHiveHash: Boolean = false)` -- a) number and clusters can be specified, b) `useHiveHash` indicates whether Hive hash should be used for partitioning. 2. `InsertIntoHiveTable` requires distribution `ClusteredDistribution(partitionAttributes ++ bucketColumns, Option(bucketSpec.numBuckets), useHiveHash = true)`. Thus to make sure one bucket has only one file. 3. Partition id is not shown in name of bucket file, thus only bucket id decides the order of bucket file names. 4. Write an empty file for empty bucket file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18866: [SPARK-21649][SQL] Support writing data into hive bucket...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18866 cc @cloud-fan Would you mind give some comments? I can keep working on this :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18866: [SPARK-21649][SQL] Support writing data into hive bucket...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18866 @viirya Please take another look when you have time. I've already updated :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18866: [SPARK-21649][SQL] Support writing data into hive...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18866#discussion_r131682897 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala --- @@ -534,4 +534,29 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef } } } + + test("Insert data into hive bucketized table.") { +sql(""" + |CREATE TABLE bucketizedTable (key int, value string) + |CLUSTERED BY (key) SORTED BY (key ASC) into 4 buckets + |ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' + |""".stripMargin) +val identifier = spark.sessionState.sqlParser.parseTableIdentifier("bucketizedTable") +val data = spark.sparkContext.parallelize((0 until 100) + .map(i => TestData(i, i.toString))).toDF() +data.write.mode(SaveMode.Overwrite).insertInto("bucketizedTable") +val dir = spark.sessionState.catalog.defaultTablePath(identifier) +val bucketFiles = new File(dir).listFiles().sortWith((a: File, b: File) => { + a.getName < b.getName +}).filter(file => file.getName.startsWith("part-")) +assert(bucketFiles.length === 4) +(0 to 3).foreach { bucket => + spark.read.format("text") +.load(bucketFiles(bucket).getAbsolutePath) +.collect().map(_.getString(0).split("\t")(0).toInt) +.zip((bucket to (100, 4))).foreach { case (a, b) => +assert(a === b) --- End diff -- Yes, that would be better. I've updated, please take another look : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18866: [SPARK-21649][SQL] Support writing data into hive bucket...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18866 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18866: [SPARK-21649][SQL] Support writing data into hive...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18866#discussion_r131607680 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala --- @@ -262,7 +262,12 @@ case class HashPartitioning(expressions: Seq[Expression], numPartitions: Int) * Returns an expression that will produce a valid partition ID(i.e. non-negative and is less * than numPartitions) based on hashing expressions. */ - def partitionIdExpression: Expression = Pmod(new Murmur3Hash(expressions), Literal(numPartitions)) + def partitionIdExpression(useHiveHash: Boolean = false): Expression = +if (useHiveHash) { + Pmod(new HiveHash(expressions), Literal(numPartitions)) --- End diff -- @viirya Thanks a lot for comment ! I compared code between v0.13 and v1.2.1. In my understanding, there's no compatibility issue. v1.2.1 add `hashcode` support for more types(`INTERVAL_YEAR_MONTH`, `INTERVAL_DAY_TIME`), for the existing types, they are compatible with each other. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18866: [SPARK-21649][SQL] Support writing data into hive bucket...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18866 I added the unit test referring (https://github.com/apache/hive/blob/branch-1/ql/src/java/org/apache/hadoop/hive/ql/optimizer/AbstractBucketJoinProc.java#L393). Hive will sort bucket files by file names when do SMB join. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18866: [SPARK-21649][SQL] Support writing data into hive...
GitHub user jinxing64 opened a pull request: https://github.com/apache/spark/pull/18866 [SPARK-21649][SQL] Support writing data into hive bucket table. ## What changes were proposed in this pull request? Support writing hive bucket table. Spark internally uses Murmur3Hash for partitioning. We can use hive hash for compatibility when write to bucket table. ## How was this patch tested? Unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jinxing64/spark SPARK-21649 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18866.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18866 commit 51d2c110d01b8a4ef1d53d144c443e0e9b43817b Author: jinxing <jinxing6...@126.com> Date: 2017-08-07T04:12:56Z Support writing data into hive bucket table. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18713: [SPARK-21509][SQL] Add a config to enable adaptiv...
Github user jinxing64 closed the pull request at: https://github.com/apache/spark/pull/18713 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18713: [SPARK-21509][SQL] Add a config to enable adaptive query...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18713 Ok, I will close this for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18735: [SPARK-21530] Update description of spark.shuffle.maxChu...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18735 @tgravescs Thanks, I should be more careful :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18735: [SPARK-21530] Update description of spark.shuffle...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18735#discussion_r129608114 --- Diff: docs/configuration.md --- @@ -636,6 +636,8 @@ Apart from these, the following properties are also available, and may be useful Long.MAX_VALUE The max number of chunks allowed to being transferred at the same time on shuffle service. +Note that new coming connections will be closed when the max number is hit. Client should --- End diff -- Yes, this is much better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18735: [SPARK-21530] Update description of spark.shuffle.maxChu...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18735 cc @tgravescs @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18735: [SPARK-21530] Update description of spark.shuffle...
GitHub user jinxing64 opened a pull request: https://github.com/apache/spark/pull/18735 [SPARK-21530] Update description of spark.shuffle.maxChunksBeingTransferred. ## What changes were proposed in this pull request? Update the description of `spark.shuffle.maxChunksBeingTransferred` to include that the new coming connections will be closed when the max is hit and client should have retry mechanism. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jinxing64/spark SPARK-21530 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18735.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18735 commit ad710dd8efc7bf577c196e95e39a31407d8ff3e7 Author: jinxing <jinxing6...@126.com> Date: 2017-07-26T00:43:16Z Update description of spark.shuffle.maxChunksBeingTransferred --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 @tgravescs Thanks for help. > I think we should expand the description of the config to say what happens when the limit is hit. Since its not using real flow control a user might set this thinking nothing bad will happen, but its dropping connections so could cause failures if the retries don't work. Could you give the link for the JIRA ? I'm happy to work on a follow-up PR if possible. For the flow control part, I'm just worrying the queue will be too large and causing memory issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 Thanks for merging ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18713: [SPARK-21509][SQL] Add a config to enable adaptive query...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18713 cc @cloud-fan @jiangxb1987 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18713: [SPARK-21509][SQL] Add a config to enable adaptiv...
GitHub user jinxing64 opened a pull request: https://github.com/apache/spark/pull/18713 [SPARK-21509][SQL] Add a config to enable adaptive query execution only for the last que⦠## What changes were proposed in this pull request? Feature of adaptive query execution is a good way to avoid generating too many small files on HDFS, like mentioned in SPARK-16188. When feature of adaptive query execution is enabled, all shuffles will be coordinated. The drawbacks: 1. It's hard to balance the num of reducers(this decides the processing speed) and file size on HDFS 2. It generates some more shuffles(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L101) 3. It generates lots of jobs, which have extra cost for scheduling. We can add a config and enable adaptive query execution only for the last shuffle. ## How was this patch tested? Unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jinxing64/spark SPARK-21509 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18713.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18713 commit efb8cd36de0e0b96a8efe855b6243c1f553ce195 Author: jinxing <jinxing6...@126.com> Date: 2017-07-22T04:22:36Z Add a config to enable adaptive query execution only for the last query execution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18388: [SPARK-21175] Reject OpenBlocks when memory short...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18388#discussion_r128794455 --- Diff: common/network-common/src/main/java/org/apache/spark/network/server/StreamManager.java --- @@ -83,4 +83,16 @@ public void connectionTerminated(Channel channel) { } */ public void checkAuthorization(TransportClient client, long streamId) { } + /** + * Return the number of chunks being transferred and not finished yet in this StreamManager. + */ + public long chunksBeingTransferred() { +return 0; + } + + /** + * Called when a chunk is successfully sent. + */ + public void chunkSent(long streamId) { } --- End diff -- Yes, it would be much better --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 @cloud-fan I understand your concern. A `TransportRequestHandler` is for a channel/connection. We want to track the sending chunks of all connections. So I guess we must have a manager for all the connections. Currently, all chunks are served from `OneForOneStreamManager`, so I put the logic there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18388: [SPARK-21175] Reject OpenBlocks when memory short...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18388#discussion_r128519475 --- Diff: common/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java --- @@ -145,7 +172,12 @@ private void processStreamRequest(final StreamRequest req) { } if (buf != null) { - respond(new StreamResponse(req.streamId, buf.size(), buf)); + respond(new StreamResponse(req.streamId, buf.size(), buf)).addListener(future -> { +if (streamManager instanceof OneForOneStreamManager) { --- End diff -- The `streamId` sent to `NettyStreamManager` is not a "{streamId}_{chunkIndex}", so fail to parse? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18388: [SPARK-21175] Reject OpenBlocks when memory short...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18388#discussion_r128498015 --- Diff: common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java --- @@ -257,4 +257,7 @@ public Properties cryptoConf() { return CryptoUtils.toCryptoConf("spark.network.crypto.config.", conf.getAll()); } + public long maxChunksBeingTransferred() { +return conf.getLong("spark.network.shuffle.maxChunksBeingTransferred", Long.MAX_VALUE); --- End diff -- It depends on memory config of NodeManager. In our cluster, we will set it to be 3G(memory overhead for Netty)/1k(size of ChannelOutboundBuffer$Entry)=300. I think it's ok to leave it MAX_VALUE(by default it's disabled). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18388: [SPARK-21175] Reject OpenBlocks when memory short...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18388#discussion_r128497296 --- Diff: common/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java --- @@ -145,7 +172,12 @@ private void processStreamRequest(final StreamRequest req) { } if (buf != null) { - respond(new StreamResponse(req.streamId, buf.size(), buf)); + respond(new StreamResponse(req.streamId, buf.size(), buf)).addListener(future -> { +if (streamManager instanceof OneForOneStreamManager) { --- End diff -- When executor fetch file from driver, it will send `StreamRequest`. `NettyStreamManager` on driver will serve the stream. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18634: [SPARK-21414] Refine SlidingWindowFunctionFrame t...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18634#discussion_r128227194 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/SQLWindowFunctionSuite.scala --- @@ -356,6 +356,46 @@ class SQLWindowFunctionSuite extends QueryTest with SharedSQLContext { spark.catalog.dropTempView("nums") } + test("window function: mutiple window expressions specified by range in a single expression") { +val nums = sparkContext.parallelize(1 to 10).map(x => (x, x % 2)).toDF("x", "y") +nums.createOrReplaceTempView("nums") +withTempView("nums") { + val expected = +Row(1, 1, 1, 4, null, 8, 25) :: + Row(1, 3, 4, 9, 1, 12, 24) :: + Row(1, 5, 9, 15, 4, 16, 21) :: + Row(1, 7, 16, 21, 8, 9, 16) :: + Row(1, 9, 25, 16, 12, null, 9) :: + Row(0, 2, 2, 6, null, 10, 30) :: + Row(0, 4, 6, 12, 2, 14, 28) :: + Row(0, 6, 12, 18, 6, 18, 24) :: + Row(0, 8, 20, 24, 10, 10, 18) :: + Row(0, 10, 30, 18, 14, null, 10) :: --- End diff -- `expected` is calculated manually. This test is to verify there is no behavior change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18388: [SPARK-21175] Reject OpenBlocks when memory short...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18388#discussion_r128217700 --- Diff: common/network-common/src/main/java/org/apache/spark/network/server/OneForOneStreamManager.java --- @@ -25,6 +25,9 @@ import com.google.common.base.Preconditions; import io.netty.channel.Channel; +import org.apache.commons.lang3.tuple.ImmutablePair; +import org.apache.commons.lang3.tuple.Pair; + --- End diff -- Ah, sorry.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18388: [SPARK-21175] Reject OpenBlocks when memory short...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18388#discussion_r128217450 --- Diff: common/network-common/src/main/java/org/apache/spark/network/server/OneForOneStreamManager.java --- @@ -139,6 +153,32 @@ public void checkAuthorization(TransportClient client, long streamId) { } } + @Override + public void chunkSent(Object streamId) { --- End diff -- Yes, a `long streamId` will be cast to `Long`, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18388: [SPARK-21175] Reject OpenBlocks when memory short...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18388#discussion_r128217046 --- Diff: common/network-common/src/main/java/org/apache/spark/network/server/OneForOneStreamManager.java --- @@ -53,9 +56,13 @@ // that the caller only requests each chunk one at a time, in order. int curChunk = 0; +// Used to keep track of the number of chunks being transferred and not finished yet. +AtomicLong chunksBeingTransferred; --- End diff -- Really? we do `chunksBeingTransferred++` and `chunksBeingTransferred--`? I think it's not safe. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18634: [SPARK-21414] Refine SlidingWindowFunctionFrame t...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18634#discussion_r128215724 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/SQLWindowFunctionSuite.scala --- @@ -359,37 +359,41 @@ class SQLWindowFunctionSuite extends QueryTest with SharedSQLContext { test("window function: mutiple window expressions specified by range in a single expression") { val nums = sparkContext.parallelize(1 to 10).map(x => (x, x % 2)).toDF("x", "y") nums.createOrReplaceTempView("nums") - -val expected = - Row(1, 1, 1, 4, 25) :: -Row(1, 3, 4, 9, 24) :: -Row(1, 5, 9, 15, 21) :: -Row(1, 7, 16, 21, 16) :: -Row(1, 9, 25, 16, 9) :: -Row(0, 2, 2, 6, 30) :: -Row(0, 4, 6, 12, 28) :: -Row(0, 6, 12, 18, 24) :: -Row(0, 8, 20, 24, 18) :: -Row(0, 10, 30, 18, 10) :: -Nil - -val actual = sql( - """ -|SELECT -| y, -| x, -| sum(x) over w1 as history_sum, -| sum(x) over w2 as period_sum, -| sum(x) over w3 as future_sum -|FROM nums -|WINDOW w1 AS (PARTITION BY y ORDER BY x RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), -| w2 AS (PARTITION BY y ORDER BY x RANGE BETWEEN 2 PRECEDING AND 2 FOLLOWING), -| w3 AS (PARTITION BY y ORDER BY x RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) - """.stripMargin -) - -checkAnswer(actual, expected) -spark.catalog.dropTempView("nums") +withTempView("nums") { + val expected = +Row(1, 1, 1, 4, null, 8, 25) :: --- End diff -- @cloud-fan This is null and do you think 0 is better? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18634: [SPARK-21414] Refine SlidingWindowFunctionFrame t...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18634#discussion_r128142152 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/SQLWindowFunctionSuite.scala --- @@ -356,6 +356,42 @@ class SQLWindowFunctionSuite extends QueryTest with SharedSQLContext { spark.catalog.dropTempView("nums") } + test("window function: mutiple window expressions specified by range in a single expression") { +val nums = sparkContext.parallelize(1 to 10).map(x => (x, x % 2)).toDF("x", "y") +nums.createOrReplaceTempView("nums") --- End diff -- Sure, I will add it later today. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18634: [SPARK-21414] Refine SlidingWindowFunctionFrame to avoid...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18634 @cloud-fan @jiangxb1987 Thanks for help! I will refine and post the result of manual test late today :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18634: [SPARK-21414] Refine SlidingWindowFunctionFrame to avoid...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18634 @jiangxb1987 Thanks a lot for quick reply ! > One concern is the test case don't reflect the improvement of this change. Yes, there is no unit test for `WindowFunctionFrame` now. The test case I added is just to check the correctness. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18634: [SPARK-21414] Refine SlidingWindowFunctionFrame to avoid...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18634 cc @cloud-fan @jiangxb1987 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175][WIP] Reject OpenBlocks when memory shortag...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 @tgravescs Thanks a lot for helping this pr ! I changed this pr, in current change: Shuffle server will track the number of chunks being transferred. Connection will be closed when the number is above water mark. I think it makes sense -- shuffle server should close the connection rather than OOM when memory pressure is high. For the flow control part, it is not included in current change. Because I don't have a proper way to balance between buffering requests and caching `ChannelOutboundBuffer$Entry`. >You mentioned a problem with just doing a close, what was the problem? I just think user should know there's another condition that can trigger closing connection. Also reconnection can have some cost. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18487: [SPARK-21243][Core] Limit no. of map outputs in a shuffl...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18487 `maxReqsInFlight` and `maxBytesInFlight` is hard to control the # of blocks in a single request. When # of map is very high, this change can alleviate the pressure of shuffle server. @dhruve what is the proper value for `maxBlocksInFlightPerAddress`? I like this pr if there is no performance issue. It will be great if you can post some benchmark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18634: [SPARK-21414] Refine SlidingWindowFunctionFrame to avoid...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18634 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18634: [SPARK-21414] Refine SlidingWindowFunctionFrame to avoid...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18634 retest please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18634: [SPARK-21414] Refine SlidingWindowFunctionFrame t...
GitHub user jinxing64 opened a pull request: https://github.com/apache/spark/pull/18634 [SPARK-21414] Refine SlidingWindowFunctionFrame to avoid OOM. ## What changes were proposed in this pull request? In `SlidingWindowFunctionFrame`, it is now adding all rows to the buffer for which the input row value is equal to or less than the output row upper bound, then drop all rows from the buffer for which the input row value is smaller than the output row lower bound. This could result in the buffer is very big though the window is small. For example: ``` select a, b, sum(a) over (partition by b order by a range between 100 following and 101 following) from table ``` We can refine the logic and just add the qualified rows into buffer. ## How was this patch tested? Added test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jinxing64/spark SPARK-21414 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18634.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18634 commit 5103ae80d456ca26c7b853f99e73c4a2c152ad41 Author: jinxing <jinxing6...@126.com> Date: 2017-07-12T12:02:28Z Refine SlidingWindowFunctionFrame to avoid OOM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175][WIP] Reject OpenBlocks when memory shortag...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 @tgravescs Thanks a lot for advice. > the flow control part should allow everyone to start fetching without rejecting a bunch, especially if the network can't push it out that fast anyway. For instance only create a handful of those outgoing buffers and wait to get successfully sent messages back for the those before creating more. Regarding flow control, can I ask a question? In flow control, we won't reject `OpenBlocks`, but do we reject `ChunkFetchRequest`? If we don't reject `ChunkFetchRequest`, it means we will accept all `ChunkFetchRequest`s. We must store the `ChunkFetchRequest`s in some buffer when memory pressure from Netty is above water mark. But buffering the `ChunkFetchRequest`s can also take much memory. That will be another issue. The problem is the client never know when to slow down sending requests. Am I wrong? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175][WIP] Reject OpenBlocks when memory shortag...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 I think it could be more efficient to do the control on shuffle service side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175][WIP] Reject OpenBlocks when memory shortag...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 Previously I was saying that I have 200k+ connections to one shuffle service. I'm sorry about this, the information is wrong. It turns out that our each `NodeManager` has two auxiliary shuffle services, one for Spark and one for "Hadoop MapReduce". Most of the connections are for "Hadoop MapReduce" shuffle service. Analyzing the heap dump, there are only 761 connection(`NioSocketChannel`)s to the Spark shuffle service. (How I found this? Spark shuffle service is using Netty4 for transferring blocks. I found tons of `org.jboss.netty.channel.socket.nio.NioAcceptedSocketChannel`, checking Netty code, I found they are only used in Netty3, that's used in our Hadoop.) So @zsxwing , there is no connection leak in my understanding. The situation is we have 10K map tasks ran on the same shuffle service and around 1K reduce tasks fetching the blocks . On java heap I found one `io.netty.channel.ChannelOutboundBuffer.Entry`(reference one block) will cost almost 1K bytes and we have 3.5M Entries. When OOM, we have `io.netty.channel.ChannelOutboundBuffer.Entry`s retaining 3GBytes. So the problem here is one connection is fetching too many blocks. I believe tuning `spark.reducer.maxReqsInFlight` or `spark.reducer.maxBlocksInFlightPerAddress`(#18487) can alleviate this issue. The question is how to set it appropriately. It seems hard because we need to make a balance between warehouse performance and stability. After all there are only 2~3 NodeManagers running OOM, we cannot set `spark.reducer.maxReqsInFlight` too small to avoid performance degradation. I checked the connections of one shuffle services yesterday. 5K connections is very common during the night. It's easy to happen, say there are 5K reduces running at the same time. What if there are 5 applications and each has 5K reduces? That will be 25k connections. If each connection is fetching 100 blocks and each `Entry` is 1KB. The memory cost is 2.5G. I think it's too much. So I'm still proposing concurrency control. Different from current change, can we control the number of blocks being transferred ? If the number is above water mark, we can fail the new coming `OpenBlocks`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175][WIP] Reject OpenBlocks when memory shortag...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 >Is this all a single application? No, it's data warehouse, there are thousands ETLs >You say 6000 nodes with 64 executors on each host, how many cores per executor? 1 core per executor. > have you tried using spark.reducer.maxReqsInFlight I think we didn't. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18405: [SPARK-21194][SQL] Fail the putNullmethod when co...
Github user jinxing64 closed the pull request at: https://github.com/apache/spark/pull/18405 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18405: [SPARK-21194][SQL] Fail the putNullmethod when containsN...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18405 Sure, I will close this pr for now :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18405: [SPARK-21194][SQL] Fail the putNullmethod when co...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18405#discussion_r126584992 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala --- @@ -758,6 +758,35 @@ class ColumnarBatchSuite extends SparkFunSuite { }} } + test("Putting null should fail when null is forbidden in array.") { +(MemoryMode.ON_HEAP :: MemoryMode.OFF_HEAP :: Nil).foreach { memMode => + val column = ColumnVector.allocate(10, new ArrayType(IntegerType, false), memMode) + val data = column.arrayData(); + data.putInt(0, 0) + data.putInt(1, 1) + assert(data.getInt(0) === 0) + assert(data.getInt(1) === 1) + val ex = intercept[RuntimeException] { +data.putNull(2) --- End diff -- Sure, did it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18593: [SPARK-21369][Core]Don't use Scala Tuple2 in common/netw...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18593 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18405: [SPARK-21194][SQL] Fail the putNullmethod when containsN...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18405 @cloud-fan More comments on this : ) ? If this is too trivial, should I close ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175][WIP] Reject OpenBlocks when memory shortag...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 @tgravescs Thanks a lot for your advice :) very helpful. I will try more on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18541: [SPARK-21315][SQL]Skip some spill files when generateIte...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18541 @jiangxb1987 Thanks for approving ! Already updated to PR description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18565: [SPARK-21342] Fix DownloadCallback to work well with Ret...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18565 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18541: [SPARK-21315][SQL]Skip some spill files when gene...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18541#discussion_r126342950 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java --- @@ -589,29 +589,51 @@ public long getKeyPrefix() { } /** - * Returns a iterator, which will return the rows in the order as inserted. + * Returns an iterator starts from startIndex, which will return the rows in the order as + * inserted. * * It is the caller's responsibility to call `cleanupResources()` * after consuming this iterator. * * TODO: support forced spilling */ - public UnsafeSorterIterator getIterator() throws IOException { + public UnsafeSorterIterator getIterator(int startIndex) throws IOException { if (spillWriters.isEmpty()) { assert(inMemSorter != null); - return inMemSorter.getSortedIterator(); + UnsafeSorterIterator iter = inMemSorter.getSortedIterator(); + moveOver(iter, startIndex); + return iter; } else { LinkedList queue = new LinkedList<>(); + int i = 0; for (UnsafeSorterSpillWriter spillWriter : spillWriters) { -queue.add(spillWriter.getReader(serializerManager)); +if (i + spillWriter.recordsSpilled() <= startIndex) { + i += spillWriter.recordsSpilled(); --- End diff -- Sure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18565: [SPARK-21342] Fix DownloadCallback to work well w...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18565#discussion_r126306193 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java --- @@ -119,9 +132,9 @@ public void onSuccess(ByteBuffer response) { // Immediately request all chunks -- we expect that the total size of the request is // reasonable due to higher level chunking in [[ShuffleBlockFetcherIterator]]. for (int i = 0; i < streamHandle.numChunks; i++) { -if (shuffleFiles != null) { +if (toDisk) { --- End diff -- Good idea. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18565: [SPARK-21342] Fix DownloadCallback to work well w...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18565#discussion_r126306207 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java --- @@ -58,29 +59,41 @@ private final BlockFetchingListener listener; private final ChunkReceivedCallback chunkCallback; private TransportConf transportConf = null; - private File[] shuffleFiles = null; + private Boolean toDisk; + private Supplier tmpFileCreater; + private Supplier shuffleBlockFetcherIteratorIsZombie; private StreamHandle streamHandle = null; public OneForOneBlockFetcher( +TransportClient client, +String appId, +String execId, +String[] blockIds, +BlockFetchingListener listener, +TransportConf transportConf) { +this(client, appId, execId, blockIds, listener, transportConf, false, null, null); + } + + public OneForOneBlockFetcher( TransportClient client, String appId, String execId, String[] blockIds, BlockFetchingListener listener, TransportConf transportConf, - File[] shuffleFiles) { + Boolean toDisk, + Supplier tmpFileCreater, + Supplier shuffleBlockFetcherIteratorIsZombie) { --- End diff -- I will refine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18541: [SPARK-21315][SQL]Skip some spill files when generateIte...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18541 @cloud-fan Would you mind take a look when you have time :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18541: [SPARK-21315][SQL]Skip some spill files when gene...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18541#discussion_r126299265 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java --- @@ -589,29 +589,50 @@ public long getKeyPrefix() { } /** - * Returns a iterator, which will return the rows in the order as inserted. + * Returns a iterator starts from startIndex, which will return the rows in the order as inserted. --- End diff -- Sure thing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18565: [SPARK-21342] Fix DownloadCallback to work well with Ret...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18565 In current change: 1. I'm using `java.util.function.Supplier` instead of `TmpFileCreater` 2. Pass `shuffleBlockFetcherIteratorIsZombie` from `ShuffleBlockFetcherIterator` to `OneForOneBlockFetcher`; 3. Succeeded files will be cleanup in `ShuffleBlockFetcherIterator`, while others will be deleted in `OneForOneBlockFetcher` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 Thanks for reply. I will figure out what I can do for this :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18565: [SPARK-21342] Fix DownloadCallback to work well with Ret...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18565 There are some corner cases because of creating and deleting in both `ShuffleBlockFetcherIterator` and `OneForOneBlockFetcher`. It seems no need to pass the `shuffleFiles` from `ShuffleBlockFetcherIterator` to `ShuffleClient` and `OneForOneBlockFetcher`. Now I changed create and delete files only in `OneForOneBlockFetcher`. We don't do renaming. Every `DownloadCallback` will have their own target file. All target files will be deleted when blockManager stop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18565: [SPARK-21342] Fix DownloadCallback to work well w...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18565#discussion_r126277863 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java --- @@ -151,15 +152,27 @@ private void failRemainingBlocks(String[] failedBlockIds, Throwable e) { } } + private static synchronized boolean renameFile(File src, File dest) { +if (dest.exists()) { + if (!dest.delete()) { --- End diff -- I was using `synchronized static` because there could be multiple `OneForOneBlockFetcher` doing the rename work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 @tgravescs Thanks a lot for reviewing this pr thus much. I think I'm making a stupid mistake. Can I ask a question, how to decide the number of connections? I'm just counting the `org.jboss.netty.channel.socket.nio.NioAcceptedSockentChannel` in the heap dump. To be honest, 200k is also beyond my imagination. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18566: Refine the document for spark.reducer.maxReqSizeShuffleT...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18566 I didn't include this config in configuration.md. Do I need to? cc @zsxwing @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18566: Refine the document for spark.reducer.maxReqSizeS...
GitHub user jinxing64 opened a pull request: https://github.com/apache/spark/pull/18566 Refine the document for spark.reducer.maxReqSizeShuffleToMem. ## What changes were proposed in this pull request? In current code, reducer can break the old shuffle service when `spark.reducer.maxReqSizeShuffleToMem` is enabled. Let's refine document. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jinxing64/spark SPARK-21343 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18566.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18566 commit 1fd9403d13f4efa31cedc81eafaf0271bc5843db Author: jinxing <jinxing6...@126.com> Date: 2017-07-07T15:56:15Z Refine the document for spark.reducer.maxReqSizeShuffleToMem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18565: [SPARK-21342] Fix DownloadCallback to work well w...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18565#discussion_r126178968 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java --- @@ -151,15 +152,27 @@ private void failRemainingBlocks(String[] failedBlockIds, Throwable e) { } } + private static synchronized boolean renameFile(File src, File dest) { +if (dest.exists()) { + if (!dest.delete()) { +return false; + } +} +return src.renameTo(dest); + } + private class DownloadCallback implements StreamCallback { private WritableByteChannel channel = null; private File targetFile = null; +private File tmpFile = null; private int chunkIndex; -DownloadCallback(File targetFile, int chunkIndex) throws IOException { - this.targetFile = targetFile; - this.channel = Channels.newChannel(new FileOutputStream(targetFile)); +DownloadCallback(int chunkIndex) throws IOException { --- End diff -- This is unrelated, just remove a redundant param. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18565: [SPARK-21342] Fix DownloadCallback to work well with Ret...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18565 cc @zsxwing @cloud-fan @jiangxb1987 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18565: [SPARK-21342] Fix DownloadCallback to work well w...
GitHub user jinxing64 opened a pull request: https://github.com/apache/spark/pull/18565 [SPARK-21342] Fix DownloadCallback to work well with RetryingBlockFetcher. ## What changes were proposed in this pull request? When `RetryingBlockFetcher` retries fetching blocks. There could be two `DownloadCallback`s download the same content to the same target file. It could cause `ShuffleBlockFetcherIterator` reading a partial result. This pr proposes to write a tmp file and rename it to the target file when finish. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jinxing64/spark SPARK-21342 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18565.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18565 commit da65427a94b24b2ad0a6ecf79ac1980dd458c684 Author: jinxing <jinxing6...@126.com> Date: 2017-07-07T15:29:58Z Fix DownloadCallback to work well with RetryingBlockFetcher. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 We didn't change `spark.shuffle.io.numConnectionsPerPeer`. Our biggest cluster has 6000 `NodeManager`s. There are 50 executors running on a same host at the same time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 @cloud-fan To be honest, it's a little bit tricky to reject "open blocks" by closing the connection. The following reconnection will surely have extra cost. In current change we are relying on retry mechanism of `RetryingBlockFetcher`. `spark.shuffle.io.maxRetries` and `spark.shuffle.io.retryWait` should also be tuned, with this change maybe their meanings become different, users should know this. This is the sacrifice for compatibility. It comes to me that can we add back `OpenBlocksFailed` and add a flag(default false)? If user wants to turned on, we can tell them they should upgrade the client. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 Analyzing the heap dump, there are 200K+ connections and 3.5M blocks(`FileSegmentManagedBuffer`) being fetched. Yes, flow control is a good idea. But I still think it make much sense to control the concurrency. Reject some "open blocks" requests, thus we can have sufficient bandwidth for the existing connections and we can finish the reduce task as soon as possible. Simple flow control(slow down connections when pressure) can help avoid OOM, but it seems more reduce tasks will run longer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18482: [SPARK-21262] Stop sending 'stream request' when ...
Github user jinxing64 closed the pull request at: https://github.com/apache/spark/pull/18482 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18482: [SPARK-21262] Stop sending 'stream request' when shuffle...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18482 Sure, I will update the document soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18487: [SPARK-21243][Core] Limit no. of map outputs in a...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18487#discussion_r125917832 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -321,6 +321,16 @@ package object config { .intConf .createWithDefault(3) + private[spark] val REDUCER_MAX_BLOCKS_IN_FLIGHT_PER_ADDRESS = +ConfigBuilder("spark.reducer.maxBlocksInFlightPerAddress") --- End diff -- When shuffle service gets OOM, there are always lots of (thousands of) reducers(maybe from different apps) fetching blocks. I'm not sure if it will help much to limit in-flight blocks from reducer. Also we've already have maxReqsInFlight and maxBytesInFlight. Is it little bit redundant to to have maxBlocksInFlightPerAddress? Sorry for this comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18541: [SPARK-21315][SQL]Skip some spill files when generateIte...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18541 @hvanhovell I refined according to your comments. Please take another look when you have time :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18541: [SPARK-21315][SQL]Skip some spill files when generateIte...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18541 Did a small test: 2000200 rows in the `UnsafeExternalSorter`: 2 spill files(each contains 100 rows) and `inMemSorter` contains 200 rows. I want to target the iterator to index=201. With this change: `getIterator(201)`, it will cost almost 0ms~1ms; Without this change: `for(int i=0; i<201; i++)geIterator().loadNext()`, it will cost 300ms; --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/16989#discussion_r125877439 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java --- @@ -126,4 +150,38 @@ private void failRemainingBlocks(String[] failedBlockIds, Throwable e) { } } } + + private class DownloadCallback implements StreamCallback { + +private WritableByteChannel channel = null; +private File targetFile = null; +private int chunkIndex; + +public DownloadCallback(File targetFile, int chunkIndex) throws IOException { + this.targetFile = targetFile; + this.channel = Channels.newChannel(new FileOutputStream(targetFile)); --- End diff -- @zsxwing Sorry, I just realized this issue. There can be conflict between two `DownloadCallback`s. I will figure out a way to resolve this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 I removed the `OpenBlocksFailed` for compatibility. In current change, the server reject the "open blocks" request by closing the connection. Then `RetryingBlockFetcher` will retry. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 @tgravescs As in the screenshot, we have tons of `ChunkOutboundBuffer$Entry`. Yes we are using `transferTo`. Netty will put the `Entry`(containing reference to the `MessageWithHeader`) into `ChannelOutboundBuffer`, then consumer will ship the data onto network. We are running OOM because of too many `ChunkOutboundBuffer$Entry`, as you can see 3GB almost. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 @tgravescs I think it's not that hurt. In current change, new client is compatible with the old and new shuffle service. In our clusters, we always upgrade the client first and then server side, which will not cause incompatible issue. The only risk here is that user upgrades the server but still using the old client. But I find no reason they do this. I think users usually tend to upgrade the client first and then deploy new servers gradually. In our cluster, there are nodemanagers failing everyday because of OOM of shuffle service. The root cause is that shuffle service is a hot point and there is no concurrency control. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18541: [SPARK-21315][SQL]Skip some spill files when gene...
GitHub user jinxing64 opened a pull request: https://github.com/apache/spark/pull/18541 [SPARK-21315][SQL]Skip some spill files when generateIterator(startIndex) in ExternalAppendOnlyUnsafeRowArray. ## What changes were proposed in this pull request? In current code, it is expensive to use `UnboundedFollowingWindowFunctionFrame`, because it is iterating from the start to lower bound every time calling `write` method. When traverse the iterator, it's possible to skip some spilled files thus to save some time. ## How was this patch tested? Added unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/jinxing64/spark SPARK-21315 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18541.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18541 commit 442abb32cc08a3311ce6540e7e40c4b62e51a0c1 Author: jinxing <jinxing6...@126.com> Date: 2017-07-05T06:22:19Z Skip some spill files when generateIterator(startIndex) in ExternalAppendOnlyUnsafeRowArray. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18482: [SPARK-21262] Stop sending 'stream request' when shuffle...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18482 Very gentle ping @zsxwing , How do you think about this idea? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 Very gentle ping @zsxwing , would you mind help comment on this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18388 Yes, there is a change. Server side may return `OpenBlocksFailed` for the "open blocks" request, which means that old client is not compatible with new server. Is it acceptable ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18482: [SPARK-21262] Stop sending 'stream request' when shuffle...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18482 In current change, it i fetching big chunk in memory and then writing to disk and then release the memory. I made this change for below reasons: 1. The client shouldn't break old shuffle service, thus cannot send "stream request" to server. We have to send `ChunkFetchRequest` and handle the `ChunkFetchSuccess` for response. 2. It's hard to make 'ChunkFetchSuccess' to be a stream and read it to disk. We need to implement another `TransportFrameDecoder`, which I think is too much cost. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18388: [SPARK-21175] Reject OpenBlocks when memory short...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18388#discussion_r125068210 --- Diff: common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java --- @@ -257,4 +257,31 @@ public Properties cryptoConf() { return CryptoUtils.toCryptoConf("spark.network.crypto.config.", conf.getAll()); } + /** + * When memory usage of Netty is above this water mark, it's regarded as memory shortage. --- End diff -- @cloud-fan I made a JIRA(https://issues.apache.org/jira/browse/SPARK-21270) about merging the memory configs. Please take a look when you have time and give some comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18482: [SPARK-21262][WIP] Stop sending 'stream request' when sh...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18482 cc @zsxwing @cloud-fan @jiangxb1987 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org