[jira] [Commented] (SPARK-21867) Support async spilling in UnsafeShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288298#comment-16288298 ] Eric Vandenberg commented on SPARK-21867: - 1. The default would be 1 so does not change default behavior. It is currently configurable to set this higher (spark.shuffle.async.num.sorter=2). 2. Yes, the number of spill files could increase, that's one reason this is not on by default. This could be an issue if it hits file system limits, etc in extreme cases. For the jobs we've tested, this wasn't as a problem. We think this improvement has biggest impact on larger jobs (we've seen cpu reduction by ~30% in some large jobs), it may not help as much for smaller jobs with fewer spills. 3. When sorter hits the threshold, it will kick off an asynchronous spill and then continue inserting into another sorter (assuming one is available.) It could make sense to raise the threshold, this would result in larger spill files. There is some risk that raising it might push too high causing an OOM and then needing to lower again. I'm thinking the algorithm could be improved by more accurately calculating and enforcing the threshold based on available memory over time, however, to do this would require exposing some memory allocation metrics not currently available (in the memory manager), so opt'd to not do that for now. 4. Yes, too many open files/buffers could be an issue. So for now this is something should look at enabling case by case as part of performance tuning. > Support async spilling in UnsafeShuffleWriter > - > > Key: SPARK-21867 > URL: https://issues.apache.org/jira/browse/SPARK-21867 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Priority: Minor > Attachments: Async ShuffleExternalSorter.pdf > > > Currently, Spark tasks are single-threaded. But we see it could greatly > improve the performance of the jobs, if we can multi-thread some part of it. > For example, profiling our map tasks, which reads large amount of data from > HDFS and spill to disks, we see that we are blocked on HDFS read and spilling > majority of the time. Since both these operations are IO intensive the > average CPU consumption during map phase is significantly low. In theory, > both HDFS read and spilling can be done in parallel if we had additional > memory to store data read from HDFS while we are spilling the last batch read. > Let's say we have 1G of shuffle memory available per task. Currently, in case > of map task, it reads from HDFS and the records are stored in the available > memory buffer. Once we hit the memory limit and there is no more space to > store the records, we sort and spill the content to disk. While we are > spilling to disk, since we do not have any available memory, we can not read > from HDFS concurrently. > Here we propose supporting async spilling for UnsafeShuffleWriter, so that we > can support reading from HDFS when sort and spill is happening > asynchronously. Let's say the total 1G of shuffle memory can be split into > two regions - active region and spilling region - each of size 500 MB. We > start with reading from HDFS and filling the active region. Once we hit the > limit of active region, we issue an asynchronous spill, while fliping the > active region and spilling region. While the spil is happening > asynchronosuly, we still have 500 MB of memory available to read the data > from HDFS. This way we can amortize the high disk/network io cost during > spilling. > We made a prototype hack to implement this feature and we could see our map > tasks were as much as 40% faster. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21867) Support async spilling in UnsafeShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Vandenberg updated SPARK-21867: Attachment: Async ShuffleExternalSorter.pdf Here is a design proposal to implement this performance improvement. Will submit PR shortly. > Support async spilling in UnsafeShuffleWriter > - > > Key: SPARK-21867 > URL: https://issues.apache.org/jira/browse/SPARK-21867 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Priority: Minor > Attachments: Async ShuffleExternalSorter.pdf > > > Currently, Spark tasks are single-threaded. But we see it could greatly > improve the performance of the jobs, if we can multi-thread some part of it. > For example, profiling our map tasks, which reads large amount of data from > HDFS and spill to disks, we see that we are blocked on HDFS read and spilling > majority of the time. Since both these operations are IO intensive the > average CPU consumption during map phase is significantly low. In theory, > both HDFS read and spilling can be done in parallel if we had additional > memory to store data read from HDFS while we are spilling the last batch read. > Let's say we have 1G of shuffle memory available per task. Currently, in case > of map task, it reads from HDFS and the records are stored in the available > memory buffer. Once we hit the memory limit and there is no more space to > store the records, we sort and spill the content to disk. While we are > spilling to disk, since we do not have any available memory, we can not read > from HDFS concurrently. > Here we propose supporting async spilling for UnsafeShuffleWriter, so that we > can support reading from HDFS when sort and spill is happening > asynchronously. Let's say the total 1G of shuffle memory can be split into > two regions - active region and spilling region - each of size 500 MB. We > start with reading from HDFS and filling the active region. Once we hit the > limit of active region, we issue an asynchronous spill, while fliping the > active region and spilling region. While the spil is happening > asynchronosuly, we still have 500 MB of memory available to read the data > from HDFS. This way we can amortize the high disk/network io cost during > spilling. > We made a prototype hack to implement this feature and we could see our map > tasks were as much as 40% faster. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22077) RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.
[ https://issues.apache.org/jira/browse/SPARK-22077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16183117#comment-16183117 ] Eric Vandenberg commented on SPARK-22077: - We are using a Facebook internal cluster scheduler. I don't think the issue is the cluster manager, the actual java.net.URI call returns null for host, name etc (see description above) so it is immediately throwing at this point. Are you saying the example above *does* parse correctly for you? If so, then I wonder if the discrepancy is due to different JDK versions. According to java.net.URI the URL supports IPv6 (https://docs.oracle.com/javase/7/docs/api/java/net/URI.html) as defined in the RFC 2732 (http://www.ietf.org/rfc/rfc2732.txt) which includes IPv6 address enclosed in square braces [], for example: http://[FEDC:BA98:7654:3210:FEDC:BA98:7654:3210]:80/index.html Thanks, Eric > RpcEndpointAddress fails to parse spark URL if it is an ipv6 address. > - > > Key: SPARK-22077 > URL: https://issues.apache.org/jira/browse/SPARK-22077 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.0 >Reporter: Eric Vandenberg >Priority: Minor > > RpcEndpointAddress fails to parse spark URL if it is an ipv6 address. > For example, > sparkUrl = "spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243" > is parsed as: > host = null > port = -1 > name = null > While sparkUrl = spark://HeartbeatReceiver@localhost:55691 is parsed properly. > This is happening on our production machines and causing spark to not start > up. > org.apache.spark.SparkException: Invalid Spark URL: > spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243 > at > org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:65) > at > org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:133) > at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88) > at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96) > at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32) > at org.apache.spark.executor.Executor.(Executor.scala:121) > at > org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173) > at org.apache.spark.SparkContext.(SparkContext.scala:507) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2283) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:833) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:825) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:825) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22077) RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.
[ https://issues.apache.org/jira/browse/SPARK-22077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16175479#comment-16175479 ] Eric Vandenberg commented on SPARK-22077: - Yes, it worked when I overloaded with "localhost" so it is just a parsing issue. However this is the default hostname in our configuration so the parsing should be more flexible for ipv6 address. > RpcEndpointAddress fails to parse spark URL if it is an ipv6 address. > - > > Key: SPARK-22077 > URL: https://issues.apache.org/jira/browse/SPARK-22077 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.0 >Reporter: Eric Vandenberg >Priority: Minor > > RpcEndpointAddress fails to parse spark URL if it is an ipv6 address. > For example, > sparkUrl = "spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243" > is parsed as: > host = null > port = -1 > name = null > While sparkUrl = spark://HeartbeatReceiver@localhost:55691 is parsed properly. > This is happening on our production machines and causing spark to not start > up. > org.apache.spark.SparkException: Invalid Spark URL: > spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243 > at > org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:65) > at > org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:133) > at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88) > at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96) > at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32) > at org.apache.spark.executor.Executor.(Executor.scala:121) > at > org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173) > at org.apache.spark.SparkContext.(SparkContext.scala:507) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2283) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:833) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:825) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:825) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22077) RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.
Eric Vandenberg created SPARK-22077: --- Summary: RpcEndpointAddress fails to parse spark URL if it is an ipv6 address. Key: SPARK-22077 URL: https://issues.apache.org/jira/browse/SPARK-22077 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.0.0 Reporter: Eric Vandenberg Priority: Minor RpcEndpointAddress fails to parse spark URL if it is an ipv6 address. For example, sparkUrl = "spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243" is parsed as: host = null port = -1 name = null While sparkUrl = spark://HeartbeatReceiver@localhost:55691 is parsed properly. This is happening on our production machines and causing spark to not start up. org.apache.spark.SparkException: Invalid Spark URL: spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243 at org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:65) at org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:133) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88) at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96) at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32) at org.apache.spark.executor.Executor.(Executor.scala:121) at org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59) at org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173) at org.apache.spark.SparkContext.(SparkContext.scala:507) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2283) at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:833) at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:825) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:825) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21598) Collect usability/events information from Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-21598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113040#comment-16113040 ] Eric Vandenberg commented on SPARK-21598: - [~steve_l] Do you have any input / thoughts here? The goal here is to collect more information than is available in typical metrics. I would like to directly correlate the replay times with other replay activity attributes like job size, user impact (ie, was user waiting for a response in real time?), etc. This is usability more than operational, this information would make it be easier to target and measure specific improvements to the spark history server user experience. We often internal users who complain on history server performance and need a way to directly reference / understand their experience since spark history server is critical for our internal debugging. If there's a way to capture this information using metrics alone would like to like to learn more but from my understanding they aren't designed to capture this level of information. > Collect usability/events information from Spark History Server > -- > > Key: SPARK-21598 > URL: https://issues.apache.org/jira/browse/SPARK-21598 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.0.2 >Reporter: Eric Vandenberg >Priority: Minor > > The Spark History Server doesn't currently have a way to collect > usability/performance on its main activity, loading/replay of history files. > We'd like to collect this information to monitor, target and measure > improvements in the spark debugging experience (via history server usage.) > Once available these usability events could be analyzed using other analytics > tools. > The event info to collect: > SparkHistoryReplayEvent( > logPath: String, > logCompressionType: String, > logReplayException: String // if an error > logReplayAction: String // user replay, vs checkForLogs replay > logCompleteFlag: Boolean, > logFileSize: Long, > logFileSizeUncompressed: Long, > logLastModifiedTimestamp: Long, > logCreationTimestamp: Long, > logJobId: Long, > logNumEvents: Int, > logNumStages: Int, > logNumTasks: Int > logReplayDurationMillis: Long > ) > The main spark engine has a SparkListenerInterface through which all compute > engine events are broadcast. It probably doesn't make sense to reuse this > abstraction for broadcasting spark history server events since the "events" > are not related or compatible with one another. Also note the metrics > registry collects history caching metrics but doesn't provide the type of > above information. > Proposal here would be to add some basic event listener infrastructure to > capture history server activity events. This would work similar to how the > SparkListener infrastructure works. It could be configured in a similar > manner, eg. spark.history.listeners=MyHistoryListenerClass. > Open to feedback / suggestions / comments on the approach or alternatives. > cc: [~vanzin] [~cloud_fan] [~ajbozarth] [~jiangxb1987] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21598) Collect usability/events information from Spark History Server
Eric Vandenberg created SPARK-21598: --- Summary: Collect usability/events information from Spark History Server Key: SPARK-21598 URL: https://issues.apache.org/jira/browse/SPARK-21598 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.0.2 Reporter: Eric Vandenberg Priority: Minor The Spark History Server doesn't currently have a way to collect usability/performance on its main activity, loading/replay of history files. We'd like to collect this information to monitor, target and measure improvements in the spark debugging experience (via history server usage.) Once available these usability events could be analyzed using other analytics tools. The event info to collect: SparkHistoryReplayEvent( logPath: String, logCompressionType: String, logReplayException: String // if an error logReplayAction: String // user replay, vs checkForLogs replay logCompleteFlag: Boolean, logFileSize: Long, logFileSizeUncompressed: Long, logLastModifiedTimestamp: Long, logCreationTimestamp: Long, logJobId: Long, logNumEvents: Int, logNumStages: Int, logNumTasks: Int logReplayDurationMillis: Long ) The main spark engine has a SparkListenerInterface through which all compute engine events are broadcast. It probably doesn't make sense to reuse this abstraction for broadcasting spark history server events since the "events" are not related or compatible with one another. Also note the metrics registry collects history caching metrics but doesn't provide the type of above information. Proposal here would be to add some basic event listener infrastructure to capture history server activity events. This would work similar to how the SparkListener infrastructure works. It could be configured in a similar manner, eg. spark.history.listeners=MyHistoryListenerClass. Open to feedback / suggestions / comments on the approach or alternatives. cc: [~vanzin] [~cloud_fan] [~ajbozarth] [~jiangxb1987] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21571) Spark history server leaves incomplete or unreadable history files around forever.
[ https://issues.apache.org/jira/browse/SPARK-21571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16109282#comment-16109282 ] Eric Vandenberg commented on SPARK-21571: - Link to pull request https://github.com/apache/spark/pull/18791 > Spark history server leaves incomplete or unreadable history files around > forever. > -- > > Key: SPARK-21571 > URL: https://issues.apache.org/jira/browse/SPARK-21571 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Eric Vandenberg >Priority: Minor > > We have noticed that history server logs are sometimes never cleaned up. The > current history server logic *ONLY* cleans up history files if they are > completed since in general it doesn't make sense to clean up inprogress > history files (after all, the job is presumably still running?) Note that > inprogress history files would generally not be targeted for clean up any way > assuming they regularly flush logs and the file system accurately updates the > history log last modified time/size, while this is likely it is not > guaranteed behavior. > As a consequence of the current clean up logic and a combination of unclean > shutdowns, various file system bugs, earlier spark bugs, etc. we have > accumulated thousands of these dead history files associated with long since > gone jobs. > For example (with spark.history.fs.cleaner.maxAge=14d): > -rw-rw 3 xx ooo > 14382 2016-09-13 15:40 > /user/hadoop/xx/spark/logs/qq1974_ppp-8812_11058600195_dev4384_-53982.zstandard > -rw-rw 3 ooo > 5933 2016-11-01 20:16 > /user/hadoop/xx/spark/logs/qq2016_ppp-8812_12650700673_dev5365_-65313.lz4 > -rw-rw 3 yyy ooo >0 2017-01-19 11:59 > /user/hadoop/xx/spark/logs/0057_326_m-57863.lz4.inprogress > -rw-rw 3 xooo >0 2017-01-19 14:17 > /user/hadoop/xx/spark/logs/0063_688_m-33246.lz4.inprogress > -rw-rw 3 yyy ooo >0 2017-01-20 10:56 > /user/hadoop/xx/spark/logs/1030_326_m-45195.lz4.inprogress > -rw-rw 3 ooo > 11955 2017-01-20 17:55 > /user/hadoop/xx/spark/logs/1314_54_kk-64671.lz4.inprogress > -rw-rw 3 ooo > 11958 2017-01-20 17:55 > /user/hadoop/xx/spark/logs/1315_1667_kk-58968.lz4.inprogress > -rw-rw 3 ooo > 11960 2017-01-20 17:55 > /user/hadoop/xx/spark/logs/1316_54_kk-48058.lz4.inprogress > Based on the current logic, clean up candidates are skipped in several cases: > 1. if a file has 0 bytes, it is completely ignored > 2. if a file is in progress and not paresable/can't extract appID, is it > completely ignored > 3. if a file is complete and but not parseable/can't extract appID, it is > completely ignored. > To address this edge case and provide a way to clean out orphaned history > files I propose a new configuration option: > spark.history.fs.cleaner.aggressive={true, false}, default is false. > If true, the history server will more aggressively garbage collect history > files in cases (1), (2) and (3). Since the default is false, existing > customers won't be affected unless they explicitly opt-in. If customers have > similar leaking garbage over time they have the option of aggressively > cleaning up in such cases. Also note that aggressive clean up may not be > appropriate for some customers if they have long running jobs that exceed the > cleaner.maxAge time frame and/or have buggy file systems. > Would like to get feedback on if this seems like a reasonable solution. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21571) Spark history server leaves incomplete or unreadable history files around forever.
[ https://issues.apache.org/jira/browse/SPARK-21571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Vandenberg updated SPARK-21571: Description: We have noticed that history server logs are sometimes never cleaned up. The current history server logic *ONLY* cleans up history files if they are completed since in general it doesn't make sense to clean up inprogress history files (after all, the job is presumably still running?) Note that inprogress history files would generally not be targeted for clean up any way assuming they regularly flush logs and the file system accurately updates the history log last modified time/size, while this is likely it is not guaranteed behavior. As a consequence of the current clean up logic and a combination of unclean shutdowns, various file system bugs, earlier spark bugs, etc. we have accumulated thousands of these dead history files associated with long since gone jobs. For example (with spark.history.fs.cleaner.maxAge=14d): -rw-rw 3 xx ooo 14382 2016-09-13 15:40 /user/hadoop/xx/spark/logs/qq1974_ppp-8812_11058600195_dev4384_-53982.zstandard -rw-rw 3 ooo 5933 2016-11-01 20:16 /user/hadoop/xx/spark/logs/qq2016_ppp-8812_12650700673_dev5365_-65313.lz4 -rw-rw 3 yyy ooo 0 2017-01-19 11:59 /user/hadoop/xx/spark/logs/0057_326_m-57863.lz4.inprogress -rw-rw 3 xooo 0 2017-01-19 14:17 /user/hadoop/xx/spark/logs/0063_688_m-33246.lz4.inprogress -rw-rw 3 yyy ooo 0 2017-01-20 10:56 /user/hadoop/xx/spark/logs/1030_326_m-45195.lz4.inprogress -rw-rw 3 ooo 11955 2017-01-20 17:55 /user/hadoop/xx/spark/logs/1314_54_kk-64671.lz4.inprogress -rw-rw 3 ooo 11958 2017-01-20 17:55 /user/hadoop/xx/spark/logs/1315_1667_kk-58968.lz4.inprogress -rw-rw 3 ooo 11960 2017-01-20 17:55 /user/hadoop/xx/spark/logs/1316_54_kk-48058.lz4.inprogress Based on the current logic, clean up candidates are skipped in several cases: 1. if a file has 0 bytes, it is completely ignored 2. if a file is in progress and not paresable/can't extract appID, is it completely ignored 3. if a file is complete and but not parseable/can't extract appID, it is completely ignored. To address this edge case and provide a way to clean out orphaned history files I propose a new configuration option: spark.history.fs.cleaner.aggressive={true, false}, default is false. If true, the history server will more aggressively garbage collect history files in cases (1), (2) and (3). Since the default is false, existing customers won't be affected unless they explicitly opt-in. If customers have similar leaking garbage over time they have the option of aggressively cleaning up in such cases. Also note that aggressive clean up may not be appropriate for some customers if they have long running jobs that exceed the cleaner.maxAge time frame and/or have buggy file systems. Would like to get feedback on if this seems like a reasonable solution. was: We have noticed that history server logs are sometimes never cleaned up. The current history server logic *ONLY* cleans up history files if they are completed since in general it doesn't make sense to clean up inprogress history files (after all, the job is presumably still running?) Note that inprogress history files would generally not be targeted for clean up any way assuming they regularly flush logs and the file system accurately updates the history log last modified time/size, while this is likely it is not guaranteed behavior. As a consequence of the current clean up logic and a combination of unclean shutdowns, various file system bugs, earlier spark bugs, etc. we have accumulated thousands of these dead history files associated with long since gone jobs. For example (with spark.history.fs.cleaner.maxAge=14d): -rw-rw 3 xx ooo 14382 2016-09-13 15:40 /user/hadoop/xx/spark/logs/qq1974_ppp-8812_11058600195_dev4384_-53982.zstandard -rw-rw 3 ooo 5933 2016-11-01 20:16 /user/
[jira] [Created] (SPARK-21571) Spark history server leaves incomplete or unreadable history files around forever.
Eric Vandenberg created SPARK-21571: --- Summary: Spark history server leaves incomplete or unreadable history files around forever. Key: SPARK-21571 URL: https://issues.apache.org/jira/browse/SPARK-21571 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.2.0 Reporter: Eric Vandenberg Priority: Minor We have noticed that history server logs are sometimes never cleaned up. The current history server logic *ONLY* cleans up history files if they are completed since in general it doesn't make sense to clean up inprogress history files (after all, the job is presumably still running?) Note that inprogress history files would generally not be targeted for clean up any way assuming they regularly flush logs and the file system accurately updates the history log last modified time/size, while this is likely it is not guaranteed behavior. As a consequence of the current clean up logic and a combination of unclean shutdowns, various file system bugs, earlier spark bugs, etc. we have accumulated thousands of these dead history files associated with long since gone jobs. For example (with spark.history.fs.cleaner.maxAge=14d): -rw-rw 3 xx ooo 14382 2016-09-13 15:40 /user/hadoop/xx/spark/logs/qq1974_ppp-8812_11058600195_dev4384_-53982.zstandard -rw-rw 3 ooo 5933 2016-11-01 20:16 /user/hadoop/xx/spark/logs/qq2016_ppp-8812_12650700673_dev5365_-65313.lz4 -rw-rw 3 yyy ooo 0 2017-01-19 11:59 /user/hadoop/xx/spark/logs/0057_326_m-57863.lz4.inprogress -rw-rw 3 xooo 0 2017-01-19 14:17 /user/hadoop/xx/spark/logs/0063_688_m-33246.lz4.inprogress -rw-rw 3 yyy ooo 0 2017-01-20 10:56 /user/hadoop/xx/spark/logs/1030_326_m-45195.lz4.inprogress -rw-rw 3 ooo 11955 2017-01-20 17:55 /user/hadoop/xx/spark/logs/1314_54_kk-64671.lz4.inprogress -rw-rw 3 ooo 11958 2017-01-20 17:55 /user/hadoop/xx/spark/logs/1315_1667_kk-58968.lz4.inprogress -rw-rw 3 ooo 11960 2017-01-20 17:55 /user/hadoop/xx/spark/logs/1316_54_kk-48058.lz4.inprogress Based on the current logic, clean up candidates are skipped in several cases: 1. if a file has 0 bytes, it is completely ignored 2. if a file is in progress, is it completely ignored 3. if a file is complete and but not parseable, or can't extract appID, it is completely ignored. To address this edge case and provide a way to clean out orphaned history files I propose a new configuration option: spark.history.fs.cleaner.aggressive={true, false}, default is false. If true, the history server will more aggressively garbage collect history files in cases (1), (2) and (3). Since the default is false, existing customers won't be affected unless they explicitly opt-in. If customers have similar leaking garbage over time they have the option of aggressively cleaning up in such cases. Also note that aggressive clean up may not be appropriate for some customers if they have long running jobs that exceed the cleaner.maxAge time frame and/or have buggy file systems. Would like to get feedback on if this seems like a reasonable solution. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11170) EOFException on History server reading in progress lz4
[ https://issues.apache.org/jira/browse/SPARK-11170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100537#comment-16100537 ] Eric Vandenberg commented on SPARK-11170: - There's a fix for this, see https://issues.apache.org/jira/browse/SPARK-21447 it's been reviewed/tested and pending merge. > EOFException on History server reading in progress lz4 > > > Key: SPARK-11170 > URL: https://issues.apache.org/jira/browse/SPARK-11170 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.1 > Environment: HDP: 2.3.2.0-2950 (Hadoop 2.7.1.2.3.2.0-2950) > Spark: 1.5.x (c27e1904) >Reporter: Sebastian YEPES FERNANDEZ > > The Spark History server is not able to read/save the jobs history if Spark > is configured to use > "spark.io.compression.codec=org.apache.spark.io.LZ4CompressionCodec", it > continuously generated the following error: > {code} > ERROR 2015-10-16 16:21:39 org.apache.spark.deploy.history.FsHistoryProvider: > Exception encountered when attempting to load application log > hdfs://DATA/user/spark/his > tory/application_1444297190346_0073_1.lz4.inprogress > java.io.EOFException: Stream ended prematurely > at > net.jpountz.lz4.LZ4BlockInputStream.readFully(LZ4BlockInputStream.java:218) > at > net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:150) > at > net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:117) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.readLine(BufferedReader.java:324) > at java.io.BufferedReader.readLine(BufferedReader.java:389) > at > scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > INFO 2015-10-16 16:21:39 org.apache.spark.deploy.history.FsHistoryProvider: > Replaying log path: > hdfs://DATA/user/spark/history/application_1444297190346_0072_1.lz4.i > nprogress > {code} > As a workaround setting > "spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec" > makes the History server work correctly -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21447) Spark history server fails to render compressed inprogress history file in some cases.
Eric Vandenberg created SPARK-21447: --- Summary: Spark history server fails to render compressed inprogress history file in some cases. Key: SPARK-21447 URL: https://issues.apache.org/jira/browse/SPARK-21447 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.0.0 Environment: Spark History Server Reporter: Eric Vandenberg Priority: Minor We've observed the Spark History Server sometimes fails to load event data from a compressed .inprogress spark history file. Note the existing logic in ReplayListenerBus is to read each line, if it can't json parse the last line and it's inprogress (maybeTruncated) then it is accepted as best effort. In the case of compressed files, the output stream will compress on the fly json serialized event data. The output is periodically flushed to disk when internal buffers are full. A consequence of that is a partially compressed frame may be flushed, and not being a complete frame, it can not be decompressed. If the spark history server attempts to read such an .inprogress compressed file it will throw an EOFException. This is really analogous to the case of failing to json parse the last line in the file (because the full line was not flushed), the difference is that since the file is compressed it is possible the compression frame was not flushed, and trying to decompress a partial frame fails in a different way the code doesn't currently handle. 17/07/13 17:24:59 ERROR FsHistoryProvider: Exception encountered when attempting to load application log hdfs:///user/hadoop/**/spark/logs/job_**-*-*.lz4.inprogress java.io.EOFException: Stream ended prematurely at org.apache.spark.io.LZ4BlockInputStream.readFully(LZ4BlockInputStream.java:230) at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:203) at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:161) at java.io.BufferedReader.readLine(BufferedReader.java:324) at java.io.BufferedReader.readLine(BufferedReader.java:389) at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72) at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:66) at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:601) at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:409) at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:310) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting
[ https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Vandenberg updated SPARK-21219: Description: When a task fails it is (1) added into the pending task list and then (2) corresponding black list policy is enforced (ie, specifying if it can/can't run on a particular node/executor/etc.) Unfortunately the ordering is such that retrying the task could assign the task to the same executor, which, incidentally could be shutting down and immediately fail the retry. Instead the order should be (1) the black list state should be updated and then (2) the task assigned, ensuring that the black list policy is properly enforced. The attached logs demonstrate the race condition. See spark_executor.log.anon: 1. Task 55.2 fails on the executor 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 39575) java.lang.OutOfMemoryError: Java heap space 2. Immediately the same executor is assigned the retry task: 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651) 3. The retry task of course fails since the executor is also shutting down due to the original task 55.2 OOM failure. See the spark_driver.log.anon: The driver processes the lost task 55.2: 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, foobar.masked-server.com, executor attempt_foobar.masked-server.com-___.masked-server.com-____0): java.lang.OutOfMemoryError: Java heap space The driver then receives the ExecutorLostFailure for the retry task 55.3 (although it's obfuscated in these logs, the server info is same...) 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, foobar.masked-server.com, executor attempt_foobar.masked-server.com-___.masked-server.com-____0): ExecutorLostFailure (executor attempt_foobar.masked-server.com-___.masked-server.com-____0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. was: When a task fails it is added into the pending task list and corresponding black list policy is enforced (ie, specifying if it can/can't run on a particular node/executor/etc.) Unfortunately the ordering is such that retrying the task could assign the task to the same executor, which, incidentally could be shutting down and immediately fail the retry. Instead the black list state should be updated and then the task assigned, ensuring that the black list policy is properly enforced. The attached logs demonstrate the race condition. See spark_executor.log.anon: 1. Task 55.2 fails on the executor 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 39575) java.lang.OutOfMemoryError: Java heap space 2. Immediately the same executor is assigned the retry task: 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651) 3. The retry task of course fails since the executor is also shutting down due to the original task 55.2 OOM failure. See the spark_driver.log.anon: The driver processes the lost task 55.2: 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, foobar.masked-server.com, executor attempt_foobar.masked-server.com-___.masked-server.com-____0): java.lang.OutOfMemoryError: Java heap space The driver then receives the ExecutorLostFailure for the retry task 55.3 (although it's obfuscated in these logs, the server info is same...) 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, foobar.masked-server.com, executor attempt_foobar.masked-server.com-___.masked-server.com-____0): ExecutorLostFailure (executor attempt_foobar.masked-server.com-___.masked-server.com-____0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. > Task retry occurs on same executor due to race condition with blacklisting > -- > > Key: SPARK-21219 > URL: https://issues.apache.org/jira/browse/SPARK-21219 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Priority: Minor > Attachments: spark_driver.log.anon, spark_executor.log.anon > > > W
[jira] [Updated] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting
[ https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Vandenberg updated SPARK-21219: Description: When a task fails it is added into the pending task list and corresponding black list policy is enforced (ie, specifying if it can/can't run on a particular node/executor/etc.) Unfortunately the ordering is such that retrying the task could assign the task to the same executor, which, incidentally could be shutting down and immediately fail the retry. Instead the black list state should be updated and then the task assigned, ensuring that the black list policy is properly enforced. The attached logs demonstrate the race condition. See spark_executor.log.anon: 1. Task 55.2 fails on the executor 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 39575) java.lang.OutOfMemoryError: Java heap space 2. Immediately the same executor is assigned the retry task: 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651) 3. The retry task of course fails since the executor is also shutting down due to the original task 55.2 OOM failure. See the spark_driver.log.anon: The driver processes the lost task 55.2: 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, foobar.masked-server.com, executor attempt_foobar.masked-server.com-___.masked-server.com-____0): java.lang.OutOfMemoryError: Java heap space The driver then receives the ExecutorLostFailure for the retry task 55.3 (although it's obfuscated in these logs, the server info is same...) 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, foobar.masked-server.com, executor attempt_foobar.masked-server.com-___.masked-server.com-____0): ExecutorLostFailure (executor attempt_foobar.masked-server.com-___.masked-server.com-____0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. was: When a task fails it is added into the pending task list and corresponding black list policy is enforced (ie, specifying if it can/can't run on a particular node/executor/etc.) Unfortunately the ordering is such that retrying the task could assign the task to the same executor, which, incidentally could be shutting down and immediately fail the retry. Instead the black list state should be updated and then the task assigned, ensuring that the black list policy is properly enforced. > Task retry occurs on same executor due to race condition with blacklisting > -- > > Key: SPARK-21219 > URL: https://issues.apache.org/jira/browse/SPARK-21219 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Priority: Minor > Attachments: spark_driver.log.anon, spark_executor.log.anon > > > When a task fails it is added into the pending task list and corresponding > black list policy is enforced (ie, specifying if it can/can't run on a > particular node/executor/etc.) Unfortunately the ordering is such that > retrying the task could assign the task to the same executor, which, > incidentally could be shutting down and immediately fail the retry. Instead > the black list state should be updated and then the task assigned, ensuring > that the black list policy is properly enforced. > The attached logs demonstrate the race condition. > See spark_executor.log.anon: > 1. Task 55.2 fails on the executor > 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID > 39575) > java.lang.OutOfMemoryError: Java heap space > 2. Immediately the same executor is assigned the retry task: > 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651 > 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651) > 3. The retry task of course fails since the executor is also shutting down > due to the original task 55.2 OOM failure. > See the spark_driver.log.anon: > The driver processes the lost task 55.2: > 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID > 39575, foobar.masked-server.com, executor > attempt_foobar.masked-server.com-___.masked-server.com-____0): > java.lang.OutOfMemoryError: Java heap space > The driver then receives the ExecutorLostFailure for the retry task 55.3 > (although it's obfuscated in these logs, the server info is same...) > 17/06/20 13:25:10 WARN TaskSetManager: Lost
[jira] [Updated] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting
[ https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Vandenberg updated SPARK-21219: Attachment: spark_executor.log.anon spark_driver.log.anon > Task retry occurs on same executor due to race condition with blacklisting > -- > > Key: SPARK-21219 > URL: https://issues.apache.org/jira/browse/SPARK-21219 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Priority: Minor > Attachments: spark_driver.log.anon, spark_executor.log.anon > > > When a task fails it is added into the pending task list and corresponding > black list policy is enforced (ie, specifying if it can/can't run on a > particular node/executor/etc.) Unfortunately the ordering is such that > retrying the task could assign the task to the same executor, which, > incidentally could be shutting down and immediately fail the retry. Instead > the black list state should be updated and then the task assigned, ensuring > that the black list policy is properly enforced. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting
Eric Vandenberg created SPARK-21219: --- Summary: Task retry occurs on same executor due to race condition with blacklisting Key: SPARK-21219 URL: https://issues.apache.org/jira/browse/SPARK-21219 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.1.1 Reporter: Eric Vandenberg Priority: Minor When a task fails it is added into the pending task list and corresponding black list policy is enforced (ie, specifying if it can/can't run on a particular node/executor/etc.) Unfortunately the ordering is such that retrying the task could assign the task to the same executor, which, incidentally could be shutting down and immediately fail the retry. Instead the black list state should be updated and then the task assigned, ensuring that the black list policy is properly enforced. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21155) Add (? running tasks) into Spark UI progress
[ https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061225#comment-16061225 ] Eric Vandenberg commented on SPARK-21155: - Added screen shot with skipped tasks for reference. > Add (? running tasks) into Spark UI progress > > > Key: SPARK-21155 > URL: https://issues.apache.org/jira/browse/SPARK-21155 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Priority: Minor > Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot > 2017-06-20 at 3.40.39 PM.png, Screen Shot 2017-06-22 at 9.58.08 AM.png > > > The progress UI for Active Jobs / Tasks should show the number of exact > number of running tasks. See screen shot attachment for what this looks like. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21155) Add (? running tasks) into Spark UI progress
[ https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Vandenberg updated SPARK-21155: Attachment: Screen Shot 2017-06-22 at 9.58.08 AM.png > Add (? running tasks) into Spark UI progress > > > Key: SPARK-21155 > URL: https://issues.apache.org/jira/browse/SPARK-21155 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Priority: Minor > Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot > 2017-06-20 at 3.40.39 PM.png, Screen Shot 2017-06-22 at 9.58.08 AM.png > > > The progress UI for Active Jobs / Tasks should show the number of exact > number of running tasks. See screen shot attachment for what this looks like. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-21155) Add (? running tasks) into Spark UI progress
[ https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Vandenberg updated SPARK-21155: Comment: was deleted (was: Before ) > Add (? running tasks) into Spark UI progress > > > Key: SPARK-21155 > URL: https://issues.apache.org/jira/browse/SPARK-21155 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Priority: Minor > Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot > 2017-06-20 at 3.40.39 PM.png > > > The progress UI for Active Jobs / Tasks should show the number of exact > number of running tasks. See screen shot attachment for what this looks like. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21155) Add (? running tasks) into Spark UI progress
[ https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Vandenberg updated SPARK-21155: Attachment: Screen Shot 2017-06-20 at 3.40.39 PM.png Before > Add (? running tasks) into Spark UI progress > > > Key: SPARK-21155 > URL: https://issues.apache.org/jira/browse/SPARK-21155 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Priority: Minor > Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot > 2017-06-20 at 3.40.39 PM.png > > > The progress UI for Active Jobs / Tasks should show the number of exact > number of running tasks. See screen shot attachment for what this looks like. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21155) Add (? running tasks) into Spark UI progress
[ https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056500#comment-16056500 ] Eric Vandenberg commented on SPARK-21155: - The pr is @ https://github.com/apache/spark/pull/18369/files The current view doesn't not contain the "(? running tasks)" text in the progress UI under Tasks. > Add (? running tasks) into Spark UI progress > > > Key: SPARK-21155 > URL: https://issues.apache.org/jira/browse/SPARK-21155 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Priority: Minor > Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png > > > The progress UI for Active Jobs / Tasks should show the number of exact > number of running tasks. See screen shot attachment for what this looks like. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21155) Add (? running tasks) into Spark UI progress
[ https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Vandenberg updated SPARK-21155: Attachment: Screen Shot 2017-06-20 at 12.32.58 PM.png > Add (? running tasks) into Spark UI progress > > > Key: SPARK-21155 > URL: https://issues.apache.org/jira/browse/SPARK-21155 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Priority: Minor > Fix For: 2.3.0 > > Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png > > > The progress UI for Active Jobs / Tasks should show the number of exact > number of running tasks. See screen shot attachment for what this looks like. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21155) Add (? running tasks) into Spark UI progress
Eric Vandenberg created SPARK-21155: --- Summary: Add (? running tasks) into Spark UI progress Key: SPARK-21155 URL: https://issues.apache.org/jira/browse/SPARK-21155 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.1.1 Reporter: Eric Vandenberg Priority: Minor Fix For: 2.3.0 The progress UI for Active Jobs / Tasks should show the number of exact number of running tasks. See screen shot attachment for what this looks like. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20778) Implement array_intersect function
Eric Vandenberg created SPARK-20778: --- Summary: Implement array_intersect function Key: SPARK-20778 URL: https://issues.apache.org/jira/browse/SPARK-20778 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Eric Vandenberg Priority: Minor Implement an array_intersect function that takes array arguments and returns an array containing all elements of the first array that is common with the remaining arrays. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org