[GitHub] spark pull request #20078: [SPARK-22900] [Spark-Streaming] Remove unnecessar...
GitHub user sharkdtu opened a pull request: https://github.com/apache/spark/pull/20078 [SPARK-22900] [Spark-Streaming] Remove unnecessary restrict for streaming dynamic allocation ## What changes were proposed in this pull request? When i set the conf `spark.streaming.dynamicAllocation.enabled=true`, the conf `num-executors` can not be set. As a result, it will allocate default 2 executors and all receivers will be run on this 2 executors, there may not be redundant cpu cores for tasks. it will stuck all the time. in my opinion, we should remove unnecessary restrict for streaming dynamic allocation. we can set `num-executors` and `spark.streaming.dynamicAllocation.enabled=true` together. when application starts, each receiver will be run on an executor. ## How was this patch tested? Manual test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sharkdtu/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20078.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20078 commit 6a7d07b7f135ed8ad079a1918fe3484757960df0 Author: sharkdtu Date: 2017-12-25T13:13:16Z remove unnecessary restrict for streaming dynamic allocation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20078: [SPARK-22900] [Spark-Streaming] Remove unnecessary restr...
Github user sharkdtu commented on the issue: https://github.com/apache/spark/pull/20078 @felixcheung At the beginning, if numReceivers > totleExecutorCores, there is not cpu cores for batch processing, and `ExecutorAllocationManager` can't listen metrics of any batches. As a result, it doesn't work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20078: [SPARK-22900] [Spark-Streaming] Remove unnecessary restr...
Github user sharkdtu commented on the issue: https://github.com/apache/spark/pull/20078 @felixcheung if you submit spark on yarn with `spark.streaming.dynamicAllocation.enabled=true`, the `num-executors` can not be set. So, at the begining, there are only 2(default value) executors. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20078: [SPARK-22900] [Spark-Streaming] Remove unnecessary restr...
Github user sharkdtu commented on the issue: https://github.com/apache/spark/pull/20078 @jerryshao if this PR can fix bugs as you said. why not fix it. Or, it should be marked as deprecated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20078: [SPARK-22900] [Spark-Streaming] Remove unnecessary restr...
Github user sharkdtu commented on the issue: https://github.com/apache/spark/pull/20078 @felixcheung Have you ever thought about initial num-executors? Actually, it is default 2 executors when you run spark on yarn. How can you make sure that this 2 executors have enougth cores for receivers at the begining? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14479: [SPARK-16873] [Core] Fix SpillReader NPE when spi...
GitHub user sharkdtu opened a pull request: https://github.com/apache/spark/pull/14479 [SPARK-16873] [Core] Fix SpillReader NPE when spillFile has no data ## What changes were proposed in this pull request? SpillReader NPE when spillFile has no data. See follow logs: 16/07/31 20:54:04 INFO collection.ExternalSorter: spill memory to file:/data4/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-db5f46c3-d7a4-4f93-8b77-565e469696fb/09/temp_shuffle_ec3ece08-4569-4197-893a-4a5dfcbbf9fa, fileSize:0.0 B 16/07/31 20:54:04 WARN memory.TaskMemoryManager: leak 164.3 MB memory from org.apache.spark.util.collection.ExternalSorter@3db4b52d 16/07/31 20:54:04 ERROR executor.Executor: Managed memory leak detected; size = 190458101 bytes, TID = 2358516/07/31 20:54:04 ERROR executor.Executor: Exception in task 1013.0 in stage 18.0 (TID 23585) java.lang.NullPointerException at org.apache.spark.util.collection.ExternalSorter$SpillReader.cleanup(ExternalSorter.scala:624) at org.apache.spark.util.collection.ExternalSorter$SpillReader.nextBatchStream(ExternalSorter.scala:539) at org.apache.spark.util.collection.ExternalSorter$SpillReader.(ExternalSorter.scala:507) at org.apache.spark.util.collection.ExternalSorter$SpillableIterator.spill(ExternalSorter.scala:816) at org.apache.spark.util.collection.ExternalSorter.forceSpill(ExternalSorter.scala:251) at org.apache.spark.util.collection.Spillable.spill(Spillable.scala:109) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:154) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249) at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 16/07/31 20:54:30 INFO executor.Executor: Executor is trying to kill task 1090.1 in stage 18.0 (TID 23793) 16/07/31 20:54:30 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown ## How was this patch tested? Manual test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sharkdtu/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14479.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14479 commit d8cf2b493a589b745d54b3b903848d4d0827e642 Author: sharkd Date: 2016-07-12T23:59:26Z rebase apache/master commit 8b0c40ab555336899b684fc2a1d6cc1c0886cd11 Author: sharkd Date: 2016-07-11T16:49:56Z fix style commit 888cf1fa2187e4f92286c74ba6a05196348eff79 Author: sharkd Date: 2016-07-12T23:59:26Z rebase apache/master commit c470ab74b1bfc4814f0ca683102ed55b6c2a1410 Author: sharkd Date: 2016-07-11T16:49:56Z fix style commit 8ae5ec71c9e12b4004d0563c9b581b590890369f Author: sharkdtu Date: 2016-08-03T11:51:45Z SpillReader NPE when spillFile has no data --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17963: [SPARK-20722][Core][History Server] Replay newer ...
GitHub user sharkdtu opened a pull request: https://github.com/apache/spark/pull/17963 [SPARK-20722][Core][History Server] Replay newer event log that hasn't be replayed in advance for request ## What changes were proposed in this pull request? History server may replay logs slowly if the size of event logs in current checking period is very large. It will get stuck for a while before entering next checking period, if we request a newer application history ui, we get the error like "Application application_1481785469354_934016 not found". We can let history server replay the newer event log in advance for request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sharkdtu/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17963.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17963 commit 3005c7cd7c57fd0c6a0ea318760dc2dc3010e3aa Author: sharkdtu Date: 2017-05-12T07:50:44Z Replay event log that hasn't be replayed in current checking period in advance for request --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16912: [SPARK-19576] [Core] Task attempt paths exist in ...
Github user sharkdtu closed the pull request at: https://github.com/apache/spark/pull/16912 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17963: [SPARK-20722][CORE] Replay newer event log that hasn't b...
Github user sharkdtu commented on the issue: https://github.com/apache/spark/pull/17963 cc @srowen @ajbozarth --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17963: [SPARK-20722][CORE] Replay newer event log that hasn't b...
Github user sharkdtu commented on the issue: https://github.com/apache/spark/pull/17963 @jerryshao Event log file will not be processed twice, you can review `FsHistoryProvider.checkForLogs` and `FsHistoryProvider.mergeApplicationListing`. In next checking period, it will check event log length by comparing to the corresponding appinfo from `fileToAppInfo`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17963: [SPARK-20722][CORE] Replay newer event log that hasn't b...
Github user sharkdtu commented on the issue: https://github.com/apache/spark/pull/17963 @jerryshao thx, i agree that. this pr may be a temporary fix before SPARK-18085 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17963: [SPARK-20722][CORE] Replay newer event log that hasn't b...
Github user sharkdtu commented on the issue: https://github.com/apache/spark/pull/17963 @ajbozarth Yes, this case is a big issue in my production cluster, where run nearly 20,000 applications every day. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17963: [SPARK-20722][CORE] Replay newer event log that h...
Github user sharkdtu closed the pull request at: https://github.com/apache/spark/pull/17963 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18352: [SPARK-21138] [YARN] Cannot delete staging dir wh...
GitHub user sharkdtu opened a pull request: https://github.com/apache/spark/pull/18352 [SPARK-21138] [YARN] Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different ## What changes were proposed in this pull request? When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as followsï¼ ``` spark.hadoop.fs.defaultFS hdfs://tl-nn-tdw.tencent-distribute.com:54310 spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark ``` The staging dir can not be deleted, it will prompt following message: ``` java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310 ``` ## How was this patch tested? Existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/sharkdtu/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18352.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18352 commit b74138e31d3317b34ffb9f13cf7fdd7873edc1a6 Author: sharkdtu Date: 2017-06-19T11:03:01Z Cannot delete staging dir when the clusters of spark.yarn.stagingDir and spark.hadoop.fs.defaultFS are different --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21658: [SPARK-24678][Spark-Streaming] Give priority in u...
GitHub user sharkdtu opened a pull request: https://github.com/apache/spark/pull/21658 [SPARK-24678][Spark-Streaming] Give priority in use of 'PROCESS_LOCAL' for spark-streaming ## What changes were proposed in this pull request? Currently, `BlockRDD.getPreferredLocations` only get hosts info of blocks, which results in subsequent schedule level is not better than 'NODE_LOCAL'. We can just make a small changes, the schedule level can be improved to 'PROCESS_LOCAL' ## How was this patch tested? manual test You can merge this pull request into a Git repository by running: $ git pull https://github.com/sharkdtu/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21658.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21658 commit 666fb4c5d343a1ea439ecc284d047810d6189c23 Author: sharkdtu Date: 2018-06-28T07:35:52Z give priority in use of 'PROCESS_LOCAL' for spark-streaming --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21658: [SPARK-24678][Spark-Streaming] Give priority in u...
Github user sharkdtu commented on a diff in the pull request: https://github.com/apache/spark/pull/21658#discussion_r200310184 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1569,7 +1569,7 @@ private[spark] object BlockManager { val blockManagers = new HashMap[BlockId, Seq[String]] for (i <- 0 until blockIds.length) { - blockManagers(blockIds(i)) = blockLocations(i).map(_.host) + blockManagers(blockIds(i)) = blockLocations(i).map(b => s"executor_${b.host}_${b.executorId}") --- End diff -- blockIdsToLocations ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21658: [SPARK-24678][Spark-Streaming] Give priority in use of '...
Github user sharkdtu commented on the issue: https://github.com/apache/spark/pull/21658 @jerryshao Yeah, I hava verified it in our cluster, and the locality is 'PROCESS_LOCAL'. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16651: [SPARK-19298][Core] History server can't match Ma...
Github user sharkdtu closed the pull request at: https://github.com/apache/spark/pull/16651 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16911: [SPARK-19576] [Core] Task attempt paths exist in ...
GitHub user sharkdtu opened a pull request: https://github.com/apache/spark/pull/16911 [SPARK-19576] [Core] Task attempt paths exist in output path after saveAsNewAPIHadoopFile completes with speculation enabled `writeShard` in `saveAsNewAPIHadoopDataset` always committed its tasks without question. The problem is that when speculation is enabled sometimes this can result in multiple tasks committing their output to the same path, which may lead to task temporary paths exist in output path after `saveAsNewAPIHadoopFile` completes. ```scala -rw-r--r--3 user group 0 2017-02-11 19:36 hdfs://.../output/_SUCCESS drwxr-xr-x- user group 0 2017-02-11 19:36 hdfs://.../output/attempt_201702111936_32487_r_44_0 -rw-r--r--3 user group8952 2017-02-11 19:36 hdfs://.../output/part-r-0 -rw-r--r--3 user group7878 2017-02-11 19:36 hdfs://.../output/part-r-1 ``` Assume there are two attempt tasks that commit at the same time, The two attempt tasks maybe rename their task attempt paths to task committed path at the same time. When one task's `rename` operation completes, the other task's `rename` operation will let its task attempt path under the task committed path. Anyway, it is not recommended that `writeShard` in `saveAsNewAPIHadoopDataset` always committed its tasks without question. Similar question in SPARK-4879 triggered by calling saveAsHadoopFile has been solved. Newest master has solved it too. This PR just fix 2.1 You can merge this pull request into a Git repository by running: $ git pull https://github.com/sharkdtu/spark branch-2.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16911.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16911 commit a7f8ebb8629706c54c286b7aca658838e718e804 Author: Cheng Lian Date: 2016-12-02T06:02:45Z [SPARK-17213][SQL] Disable Parquet filter push-down for string and binary columns due to PARQUET-686 This PR targets to both master and branch-2.1. ## What changes were proposed in this pull request? Due to PARQUET-686, Parquet doesn't do string comparison correctly while doing filter push-down for string columns. This PR disables filter push-down for both string and binary columns to work around this issue. Binary columns are also affected because some Parquet data models (like Hive) may store string columns as a plain Parquet `binary` instead of a `binary (UTF8)`. ## How was this patch tested? New test case added in `ParquetFilterSuite`. Author: Cheng Lian Closes #16106 from liancheng/spark-17213-bad-string-ppd. (cherry picked from commit ca6391637212814b7c0bd14c434a6737da17b258) Signed-off-by: Reynold Xin commit 65e896a6e9a5378f2d3a02c0c2a57fdb8d8f1d9d Author: Eric Liang Date: 2016-12-02T12:59:39Z [SPARK-18679][SQL] Fix regression in file listing performance for non-catalog tables ## What changes were proposed in this pull request? In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory. This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors). cc mallman cloud-fan ## How was this patch tested? Checked metrics in unit tests. Author: Eric Liang Closes #16112 from ericl/spark-18679. (cherry picked from commit 294163ee9319e4f7f6da1259839eb3c80bba25c2) Signed-off-by: Wenchen Fan commit 415730e19cea3a0e7ea5491bf801a22859bbab66 Author: Dongjoon Hyun Date: 2016-12-02T13:48:22Z [SPARK-18419][SQL] `JDBCRelation.insert` should not remove Spark options ## What changes were proposed in this pull request? Currently, `JDBCRelation.insert` removes Spark options too early by mistakenly using `asConnectionProperties`. Spark options like `numPartitions` should be passed into `DataFrameWriter.jdbc` correctly. This bug have been **hidden** because `JDBCOptions.asConnectionProperties` fails to filter out the mixed-case options. This PR aims to fix both. **JDBCRelation.insert** ```scala override def insert(data: DataFrame, overwrite: Boolean): Unit = { val url = jdbcOptions.url val table = jdbcOptions.table - val properties = jdbcOptions.asConnectionProperties +
[GitHub] spark pull request #16911: [SPARK-19576] [Core] Task attempt paths exist in ...
Github user sharkdtu closed the pull request at: https://github.com/apache/spark/pull/16911 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16912: [SPARK-19576] [Core] Task attempt paths exist in ...
GitHub user sharkdtu opened a pull request: https://github.com/apache/spark/pull/16912 [SPARK-19576] [Core] Task attempt paths exist in output path after saveAsNewAPIHadoopFile completes with speculation enabled `writeShard` in `saveAsNewAPIHadoopDataset` always committed its tasks without question. The problem is that when speculation is enabled sometimes this can result in multiple tasks committing their output to the same path, which may lead to task temporary paths exist in output path after `saveAsNewAPIHadoopFile` completes. ```scala -rw-r--r--3 user group 0 2017-02-11 19:36 hdfs://.../output/_SUCCESS drwxr-xr-x- user group 0 2017-02-11 19:36 hdfs://.../output/attempt_201702111936_32487_r_44_0 -rw-r--r--3 user group8952 2017-02-11 19:36 hdfs://.../output/part-r-0 -rw-r--r--3 user group7878 2017-02-11 19:36 hdfs://.../output/part-r-1 ``` Assume there are two attempt tasks that commit at the same time, The two attempt tasks maybe rename their task attempt paths to task committed path at the same time. When one task's `rename` operation completes, the other task's `rename` operation will let its task attempt path under the task committed path. Anyway, it is not recommended that `writeShard` in `saveAsNewAPIHadoopDataset` always committed its tasks without question. Similar question in SPARK-4879 triggered by calling saveAsHadoopFile has been solved. Newest master has solved it too. This PR just fix 2.1 You can merge this pull request into a Git repository by running: $ git pull https://github.com/sharkdtu/spark branch-2.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16912.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16912 commit 6f41b90583c585414b99fe716377d0576499de8d Author: sharkdtu Date: 2017-02-13T11:46:48Z Task attempt paths exist in output path after saveAsNewAPIHadoopFile completes with speculation enabled --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16651: [SPARK-19298][Core] History server can't match Ma...
GitHub user sharkdtu opened a pull request: https://github.com/apache/spark/pull/16651 [SPARK-19298][Core] History server can't match MalformedInputException and prompt the detail logs while repalying eventlog History server can't match MalformedInputException and prompt the detail logs while repalying eventlog, because MalformedInputException is a subclass of IOException. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sharkdtu/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16651.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16651 commit 07f59016d6175d5aac0242f7432ce09bb3f984b0 Author: sharkdtu Date: 2017-01-20T02:06:55Z fix MalformedInputException match --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16651: [SPARK-19298][Core] History server can't match Ma...
Github user sharkdtu commented on a diff in the pull request: https://github.com/apache/spark/pull/16651#discussion_r97034201 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala --- @@ -107,11 +107,11 @@ private[spark] class ReplayListenerBus extends SparkListenerBus with Logging { } } } catch { - case ioe: IOException => -throw ioe - case e: Exception => -logError(s"Exception parsing Spark event log: $sourceName", e) + case ex: MalformedInputException => --- End diff -- thx, forgot to import MalformedInputException --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16651: [SPARK-19298][Core] History server can't match Ma...
Github user sharkdtu commented on a diff in the pull request: https://github.com/apache/spark/pull/16651#discussion_r97037524 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala --- @@ -107,11 +107,11 @@ private[spark] class ReplayListenerBus extends SparkListenerBus with Logging { } } } catch { - case ioe: IOException => -throw ioe - case e: Exception => -logError(s"Exception parsing Spark event log: $sourceName", e) + case ex: MalformedInputException => --- End diff -- please check: https://issues.apache.org/jira/browse/SPARK-19298 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16651: [SPARK-19298][Core] History server can't match Malformed...
Github user sharkdtu commented on the issue: https://github.com/apache/spark/pull/16651 @srowen i think the logs were just for `MalformedInputException`, it does't matter that non-IOExceptions will be rethrown, because they will be catched by upper callers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Core] Remove unnecessary calculation of stage...
GitHub user sharkdtu opened a pull request: https://github.com/apache/spark/pull/13123 [Core] Remove unnecessary calculation of stage's parents ## What changes were proposed in this pull request? Remove unnecessary calculation of stage's parents, because stage's parents have been set at the time of stage construction. ## How was this patch tested? Make use of the existing test cases You can merge this pull request into a Git repository by running: $ git pull https://github.com/sharkdtu/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13123.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13123 commit 6e93108e5a6642938bc1d16c3f204714f05e4bd5 Author: sharkd Date: 2016-05-15T09:15:02Z Remove unnecessary calculation of stage's parents --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14088: Fix bugs for "Can not get user config when callin...
GitHub user sharkdtu opened a pull request: https://github.com/apache/spark/pull/14088 Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf in other places" ## What changes were proposed in this pull request? Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf in other places". The `SparkHadoopUtil` singleton was instantiated before `ApplicationMaster`, So the `sparkConf` and `conf` in the `SparkHadoopUtil` singleton didn't include user's configuration. But other places, such as `DataSourceStrategy`, use `hadoopConf` in `SparkHadoopUtil`: ```scala ... case PhysicalOperation(projects, filters, l @ LogicalRelation(t: HadoopFsRelation, _)) => // See buildPartitionedTableScan for the reason that we need to create a shard // broadcast HadoopConf. val sharedHadoopConf = SparkHadoopUtil.get.conf val confBroadcast = t.sqlContext.sparkContext.broadcast(new SerializableConfiguration(sharedHadoopConf)) ... ``` ## How was this patch tested? Use exist test cases You can merge this pull request into a Git repository by running: $ git pull https://github.com/sharkdtu/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14088.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14088 commit 55e66b21cdcd68861db0f1045186048c54b13153 Author: sharkdtu Date: 2016-07-07T11:04:11Z Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf in other places, such as DataSourceStrategy" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14088: [SPARK-16414] [YARN] Fix bugs for "Can not get user conf...
Github user sharkdtu commented on the issue: https://github.com/apache/spark/pull/14088 @tgravescs fixed the description and style --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14088: [SPARK-16414] [YARN] Fix bugs for "Can not get us...
Github user sharkdtu commented on a diff in the pull request: https://github.com/apache/spark/pull/14088#discussion_r70076189 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -743,6 +735,14 @@ object ApplicationMaster extends Logging { def main(args: Array[String]): Unit = { SignalUtils.registerLogger(log) val amArgs = new ApplicationMasterArguments(args) + +// Load the properties file with the Spark configuration and set entries as system properties, +// so that user code run inside the AM also has access to them. --- End diff -- @tgravescs Thanks, done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14088: [SPARK-16414] [YARN] Fix bugs for "Can not get us...
Github user sharkdtu commented on a diff in the pull request: https://github.com/apache/spark/pull/14088#discussion_r70362297 --- Diff: yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala --- @@ -274,6 +288,37 @@ private object YarnClusterDriverWithFailure extends Logging with Matchers { } } +private object YarnClusterDriverUseSparkHadoopUtilConf extends Logging with Matchers { + def main(args: Array[String]): Unit = { +if (args.length != 2) { + // scalastyle:off println + System.err.println( +s""" +|Invalid command line: ${args.mkString(" ")} +| +|Usage: YarnClusterDriverUseSparkHadoopUtilConf [propertyKey=value] [result file] +""".stripMargin) + // scalastyle:on println + System.exit(1) +} + +val sc = new SparkContext(new SparkConf() + .set("spark.extraListeners", classOf[SaveExecutorInfo].getName) + .setAppName("yarn test using SparkHadoopUtil's conf")) + +val propertyKeyValue = args(0).split("=") +val status = new File(args(1)) +var result = "failure" +try { + SparkHadoopUtil.get.conf.get(propertyKeyValue(0).drop(13)) should be (propertyKeyValue(1)) --- End diff -- it means drop `spark.hadoop.`. it may be hard to understand and i will fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14166: [MINOR][YARN] Fix code error in yarn-cluster unit...
GitHub user sharkdtu opened a pull request: https://github.com/apache/spark/pull/14166 [MINOR][YARN] Fix code error in yarn-cluster unit test ## What changes were proposed in this pull request? Fix code error in yarn-cluster unit test. ## How was this patch tested? Use exist tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/sharkdtu/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14166.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14166 commit afb56a27b61a81d17a16405c95872eddff7e0bd1 Author: sharkd Date: 2016-07-12T23:59:26Z rebase apache/master commit 995d606243a95965cb0be28cf7006883400e09ac Author: sharkd Date: 2016-07-11T16:49:56Z fix style commit 816979bc5e834aebd23e485bc6251640573fb0a4 Author: sharkd Date: 2016-07-12T23:14:02Z fix code error in yarn-cluster unit test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org