[jira] [Commented] (SPARK-21867) Support async spilling in UnsafeShuffleWriter

2017-12-12 Thread Eric Vandenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288298#comment-16288298
 ] 

Eric Vandenberg commented on SPARK-21867:
-

1. The default would be 1 so does not change default behavior.  It is currently 
configurable to set this higher (spark.shuffle.async.num.sorter=2).
2. Yes, the number of spill files could increase, that's one reason this is not 
on by default.  This could be an issue if it hits file system limits, etc in 
extreme cases.  For the jobs we've tested, this wasn't as a problem.  We think 
this improvement has biggest impact on larger jobs (we've seen cpu reduction by 
~30% in some large jobs), it may not help as much for smaller jobs with fewer 
spills.
3. When sorter hits the threshold, it will kick off an asynchronous spill and 
then continue inserting into another sorter (assuming one is available.)  It 
could make sense to raise the threshold, this would result in larger spill 
files.  There is some risk that raising it might push too high causing an OOM 
and then needing to lower again.  I'm thinking the algorithm could be improved 
by more accurately calculating and enforcing the threshold based on available 
memory over time, however, to do this would require exposing some memory 
allocation metrics not currently available (in the memory manager), so opt'd to 
not do that for now.
4. Yes, too many open files/buffers could be an issue.  So for now this is 
something should look at enabling case by case as part of performance tuning.


> Support async spilling in UnsafeShuffleWriter
> -
>
> Key: SPARK-21867
> URL: https://issues.apache.org/jira/browse/SPARK-21867
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Priority: Minor
> Attachments: Async ShuffleExternalSorter.pdf
>
>
> Currently, Spark tasks are single-threaded. But we see it could greatly 
> improve the performance of the jobs, if we can multi-thread some part of it. 
> For example, profiling our map tasks, which reads large amount of data from 
> HDFS and spill to disks, we see that we are blocked on HDFS read and spilling 
> majority of the time. Since both these operations are IO intensive the 
> average CPU consumption during map phase is significantly low. In theory, 
> both HDFS read and spilling can be done in parallel if we had additional 
> memory to store data read from HDFS while we are spilling the last batch read.
> Let's say we have 1G of shuffle memory available per task. Currently, in case 
> of map task, it reads from HDFS and the records are stored in the available 
> memory buffer. Once we hit the memory limit and there is no more space to 
> store the records, we sort and spill the content to disk. While we are 
> spilling to disk, since we do not have any available memory, we can not read 
> from HDFS concurrently. 
> Here we propose supporting async spilling for UnsafeShuffleWriter, so that we 
> can support reading from HDFS when sort and spill is happening 
> asynchronously.  Let's say the total 1G of shuffle memory can be split into 
> two regions - active region and spilling region - each of size 500 MB. We 
> start with reading from HDFS and filling the active region. Once we hit the 
> limit of active region, we issue an asynchronous spill, while fliping the 
> active region and spilling region. While the spil is happening 
> asynchronosuly, we still have 500 MB of memory available to read the data 
> from HDFS. This way we can amortize the high disk/network io cost during 
> spilling.
> We made a prototype hack to implement this feature and we could see our map 
> tasks were as much as 40% faster. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21867) Support async spilling in UnsafeShuffleWriter

2017-12-11 Thread Eric Vandenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21867:

Attachment: Async ShuffleExternalSorter.pdf

Here is a design proposal to implement this performance improvement.  Will 
submit PR shortly.

> Support async spilling in UnsafeShuffleWriter
> -
>
> Key: SPARK-21867
> URL: https://issues.apache.org/jira/browse/SPARK-21867
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Priority: Minor
> Attachments: Async ShuffleExternalSorter.pdf
>
>
> Currently, Spark tasks are single-threaded. But we see it could greatly 
> improve the performance of the jobs, if we can multi-thread some part of it. 
> For example, profiling our map tasks, which reads large amount of data from 
> HDFS and spill to disks, we see that we are blocked on HDFS read and spilling 
> majority of the time. Since both these operations are IO intensive the 
> average CPU consumption during map phase is significantly low. In theory, 
> both HDFS read and spilling can be done in parallel if we had additional 
> memory to store data read from HDFS while we are spilling the last batch read.
> Let's say we have 1G of shuffle memory available per task. Currently, in case 
> of map task, it reads from HDFS and the records are stored in the available 
> memory buffer. Once we hit the memory limit and there is no more space to 
> store the records, we sort and spill the content to disk. While we are 
> spilling to disk, since we do not have any available memory, we can not read 
> from HDFS concurrently. 
> Here we propose supporting async spilling for UnsafeShuffleWriter, so that we 
> can support reading from HDFS when sort and spill is happening 
> asynchronously.  Let's say the total 1G of shuffle memory can be split into 
> two regions - active region and spilling region - each of size 500 MB. We 
> start with reading from HDFS and filling the active region. Once we hit the 
> limit of active region, we issue an asynchronous spill, while fliping the 
> active region and spilling region. While the spil is happening 
> asynchronosuly, we still have 500 MB of memory available to read the data 
> from HDFS. This way we can amortize the high disk/network io cost during 
> spilling.
> We made a prototype hack to implement this feature and we could see our map 
> tasks were as much as 40% faster. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22077) RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.

2017-09-27 Thread Eric Vandenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16183117#comment-16183117
 ] 

Eric Vandenberg commented on SPARK-22077:
-

We are using a Facebook internal cluster scheduler.  I don't think the issue is 
the cluster manager, the actual java.net.URI call returns null for host, name 
etc (see description above) so it is immediately throwing at this point.  Are 
you saying the example above *does* parse correctly for you?  If so, then I 
wonder if the discrepancy is due to different JDK versions. 

According to java.net.URI the URL supports IPv6 
(https://docs.oracle.com/javase/7/docs/api/java/net/URI.html) as defined in the 
RFC 2732 (http://www.ietf.org/rfc/rfc2732.txt) which includes IPv6 address 
enclosed in square braces [], for example: 
http://[FEDC:BA98:7654:3210:FEDC:BA98:7654:3210]:80/index.html

Thanks,
Eric


> RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.
> -
>
> Key: SPARK-22077
> URL: https://issues.apache.org/jira/browse/SPARK-22077
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0
>Reporter: Eric Vandenberg
>Priority: Minor
>
> RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.
> For example, 
> sparkUrl = "spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243"
> is parsed as:
> host = null
> port = -1
> name = null
> While sparkUrl = spark://HeartbeatReceiver@localhost:55691 is parsed properly.
> This is happening on our production machines and causing spark to not start 
> up.
> org.apache.spark.SparkException: Invalid Spark URL: 
> spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243
>   at 
> org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:65)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:133)
>   at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
>   at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
>   at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)
>   at org.apache.spark.executor.Executor.(Executor.scala:121)
>   at 
> org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
>   at org.apache.spark.SparkContext.(SparkContext.scala:507)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2283)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:833)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:825)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:825)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22077) RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.

2017-09-21 Thread Eric Vandenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16175479#comment-16175479
 ] 

Eric Vandenberg commented on SPARK-22077:
-

Yes, it worked when I overloaded with "localhost" so it is just a parsing 
issue.  However this is the default hostname in our configuration so the 
parsing should be more flexible for ipv6 address.

> RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.
> -
>
> Key: SPARK-22077
> URL: https://issues.apache.org/jira/browse/SPARK-22077
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0
>Reporter: Eric Vandenberg
>Priority: Minor
>
> RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.
> For example, 
> sparkUrl = "spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243"
> is parsed as:
> host = null
> port = -1
> name = null
> While sparkUrl = spark://HeartbeatReceiver@localhost:55691 is parsed properly.
> This is happening on our production machines and causing spark to not start 
> up.
> org.apache.spark.SparkException: Invalid Spark URL: 
> spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243
>   at 
> org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:65)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:133)
>   at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
>   at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
>   at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)
>   at org.apache.spark.executor.Executor.(Executor.scala:121)
>   at 
> org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
>   at org.apache.spark.SparkContext.(SparkContext.scala:507)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2283)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:833)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:825)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:825)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22077) RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.

2017-09-20 Thread Eric Vandenberg (JIRA)
Eric Vandenberg created SPARK-22077:
---

 Summary: RpcEndpointAddress fails to parse spark URL if it is an 
ipv6 address.
 Key: SPARK-22077
 URL: https://issues.apache.org/jira/browse/SPARK-22077
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.0.0
Reporter: Eric Vandenberg
Priority: Minor


RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.

For example, 
sparkUrl = "spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243"

is parsed as:
host = null
port = -1
name = null

While sparkUrl = spark://HeartbeatReceiver@localhost:55691 is parsed properly.

This is happening on our production machines and causing spark to not start up.

org.apache.spark.SparkException: Invalid Spark URL: 
spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243
at 
org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:65)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:133)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)
at org.apache.spark.executor.Executor.(Executor.scala:121)
at 
org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)
at 
org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
at org.apache.spark.SparkContext.(SparkContext.scala:507)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2283)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:833)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:825)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:825)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21598) Collect usability/events information from Spark History Server

2017-08-03 Thread Eric Vandenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113040#comment-16113040
 ] 

Eric Vandenberg commented on SPARK-21598:
-

[~steve_l]  Do you have any input / thoughts here?  The goal here is to collect 
more information than is available in typical metrics.  I would like to 
directly correlate the replay times with other replay activity attributes like 
job size, user impact (ie, was user waiting for a response in real time?), etc. 
 This is usability more than operational, this information would make it be 
easier to target and measure specific improvements to the spark history server 
user experience.  We often internal users who complain on history server 
performance and need a way to directly reference / understand their experience 
since spark history server is critical for our internal debugging.  If there's 
a way to capture this information using metrics alone would like to like to 
learn more but from my understanding they aren't designed to capture this level 
of information.

> Collect usability/events information from Spark History Server
> --
>
> Key: SPARK-21598
> URL: https://issues.apache.org/jira/browse/SPARK-21598
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.0.2
>Reporter: Eric Vandenberg
>Priority: Minor
>
> The Spark History Server doesn't currently have a way to collect 
> usability/performance on its main activity, loading/replay of history files.  
> We'd like to collect this information to monitor, target and measure 
> improvements in the spark debugging experience (via history server usage.)  
> Once available these usability events could be analyzed using other analytics 
> tools.
> The event info to collect:
> SparkHistoryReplayEvent(
> logPath: String,
> logCompressionType: String,
> logReplayException: String // if an error
> logReplayAction: String // user replay, vs checkForLogs replay
> logCompleteFlag: Boolean,
> logFileSize: Long,
> logFileSizeUncompressed: Long,
> logLastModifiedTimestamp: Long,
> logCreationTimestamp: Long,
> logJobId: Long,
> logNumEvents: Int,
> logNumStages: Int,
> logNumTasks: Int
> logReplayDurationMillis: Long
> )
> The main spark engine has a SparkListenerInterface through which all compute 
> engine events are broadcast.  It probably doesn't make sense to reuse this 
> abstraction for broadcasting spark history server events since the "events" 
> are not related or compatible with one another.  Also note the metrics 
> registry collects history caching metrics but doesn't provide the type of 
> above information.
> Proposal here would be to add some basic event listener infrastructure to 
> capture history server activity events.  This would work similar to how the 
> SparkListener infrastructure works.  It could be configured in a similar 
> manner, eg. spark.history.listeners=MyHistoryListenerClass.
> Open to feedback / suggestions / comments on the approach or alternatives.
> cc: [~vanzin] [~cloud_fan] [~ajbozarth] [~jiangxb1987]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21598) Collect usability/events information from Spark History Server

2017-08-01 Thread Eric Vandenberg (JIRA)
Eric Vandenberg created SPARK-21598:
---

 Summary: Collect usability/events information from Spark History 
Server
 Key: SPARK-21598
 URL: https://issues.apache.org/jira/browse/SPARK-21598
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.0.2
Reporter: Eric Vandenberg
Priority: Minor


The Spark History Server doesn't currently have a way to collect 
usability/performance on its main activity, loading/replay of history files.  
We'd like to collect this information to monitor, target and measure 
improvements in the spark debugging experience (via history server usage.)  
Once available these usability events could be analyzed using other analytics 
tools.

The event info to collect:
SparkHistoryReplayEvent(
logPath: String,
logCompressionType: String,
logReplayException: String // if an error
logReplayAction: String // user replay, vs checkForLogs replay
logCompleteFlag: Boolean,
logFileSize: Long,
logFileSizeUncompressed: Long,
logLastModifiedTimestamp: Long,
logCreationTimestamp: Long,
logJobId: Long,
logNumEvents: Int,
logNumStages: Int,
logNumTasks: Int
logReplayDurationMillis: Long
)

The main spark engine has a SparkListenerInterface through which all compute 
engine events are broadcast.  It probably doesn't make sense to reuse this 
abstraction for broadcasting spark history server events since the "events" are 
not related or compatible with one another.  Also note the metrics registry 
collects history caching metrics but doesn't provide the type of above 
information.

Proposal here would be to add some basic event listener infrastructure to 
capture history server activity events.  This would work similar to how the 
SparkListener infrastructure works.  It could be configured in a similar 
manner, eg. spark.history.listeners=MyHistoryListenerClass.

Open to feedback / suggestions / comments on the approach or alternatives.

cc: [~vanzin] [~cloud_fan] [~ajbozarth] [~jiangxb1987]





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21571) Spark history server leaves incomplete or unreadable history files around forever.

2017-08-01 Thread Eric Vandenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16109282#comment-16109282
 ] 

Eric Vandenberg commented on SPARK-21571:
-

Link to pull request https://github.com/apache/spark/pull/18791

> Spark history server leaves incomplete or unreadable history files around 
> forever.
> --
>
> Key: SPARK-21571
> URL: https://issues.apache.org/jira/browse/SPARK-21571
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Eric Vandenberg
>Priority: Minor
>
> We have noticed that history server logs are sometimes never cleaned up.  The 
> current history server logic *ONLY* cleans up history files if they are 
> completed since in general it doesn't make sense to clean up inprogress 
> history files (after all, the job is presumably still running?)  Note that 
> inprogress history files would generally not be targeted for clean up any way 
> assuming they regularly flush logs and the file system accurately updates the 
> history log last modified time/size, while this is likely it is not 
> guaranteed behavior.
> As a consequence of the current clean up logic and a combination of unclean 
> shutdowns, various file system bugs, earlier spark bugs, etc. we have 
> accumulated thousands of these dead history files associated with long since 
> gone jobs.
> For example (with spark.history.fs.cleaner.maxAge=14d):
> -rw-rw   3 xx   ooo  
> 14382 2016-09-13 15:40 
> /user/hadoop/xx/spark/logs/qq1974_ppp-8812_11058600195_dev4384_-53982.zstandard
> -rw-rw   3  ooo   
> 5933 2016-11-01 20:16 
> /user/hadoop/xx/spark/logs/qq2016_ppp-8812_12650700673_dev5365_-65313.lz4
> -rw-rw   3 yyy  ooo   
>0 2017-01-19 11:59 
> /user/hadoop/xx/spark/logs/0057_326_m-57863.lz4.inprogress
> -rw-rw   3 xooo   
>0 2017-01-19 14:17 
> /user/hadoop/xx/spark/logs/0063_688_m-33246.lz4.inprogress
> -rw-rw   3 yyy  ooo   
>0 2017-01-20 10:56 
> /user/hadoop/xx/spark/logs/1030_326_m-45195.lz4.inprogress
> -rw-rw   3  ooo  
> 11955 2017-01-20 17:55 
> /user/hadoop/xx/spark/logs/1314_54_kk-64671.lz4.inprogress
> -rw-rw   3  ooo  
> 11958 2017-01-20 17:55 
> /user/hadoop/xx/spark/logs/1315_1667_kk-58968.lz4.inprogress
> -rw-rw   3  ooo  
> 11960 2017-01-20 17:55 
> /user/hadoop/xx/spark/logs/1316_54_kk-48058.lz4.inprogress
> Based on the current logic, clean up candidates are skipped in several cases:
> 1. if a file has 0 bytes, it is completely ignored
> 2. if a file is in progress and not paresable/can't extract appID, is it 
> completely ignored
> 3. if a file is complete and but not parseable/can't extract appID, it is 
> completely ignored.
> To address this edge case and provide a way to clean out orphaned history 
> files I propose a new configuration option:
> spark.history.fs.cleaner.aggressive={true, false}, default is false.
> If true, the history server will more aggressively garbage collect history 
> files in cases (1), (2) and (3).  Since the default is false, existing 
> customers won't be affected unless they explicitly opt-in.  If customers have 
> similar leaking garbage over time they have the option of aggressively 
> cleaning up in such cases.  Also note that aggressive clean up may not be 
> appropriate for some customers if they have long running jobs that exceed the 
> cleaner.maxAge time frame and/or have buggy file systems.
> Would like to get feedback on if this seems like a reasonable solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21571) Spark history server leaves incomplete or unreadable history files around forever.

2017-07-31 Thread Eric Vandenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21571:

Description: 
We have noticed that history server logs are sometimes never cleaned up.  The 
current history server logic *ONLY* cleans up history files if they are 
completed since in general it doesn't make sense to clean up inprogress history 
files (after all, the job is presumably still running?)  Note that inprogress 
history files would generally not be targeted for clean up any way assuming 
they regularly flush logs and the file system accurately updates the history 
log last modified time/size, while this is likely it is not guaranteed behavior.

As a consequence of the current clean up logic and a combination of unclean 
shutdowns, various file system bugs, earlier spark bugs, etc. we have 
accumulated thousands of these dead history files associated with long since 
gone jobs.

For example (with spark.history.fs.cleaner.maxAge=14d):

-rw-rw   3 xx   ooo  
14382 2016-09-13 15:40 
/user/hadoop/xx/spark/logs/qq1974_ppp-8812_11058600195_dev4384_-53982.zstandard
-rw-rw   3  ooo   
5933 2016-11-01 20:16 
/user/hadoop/xx/spark/logs/qq2016_ppp-8812_12650700673_dev5365_-65313.lz4
-rw-rw   3 yyy  ooo 
 0 2017-01-19 11:59 
/user/hadoop/xx/spark/logs/0057_326_m-57863.lz4.inprogress
-rw-rw   3 xooo 
 0 2017-01-19 14:17 
/user/hadoop/xx/spark/logs/0063_688_m-33246.lz4.inprogress
-rw-rw   3 yyy  ooo 
 0 2017-01-20 10:56 
/user/hadoop/xx/spark/logs/1030_326_m-45195.lz4.inprogress
-rw-rw   3  ooo  
11955 2017-01-20 17:55 
/user/hadoop/xx/spark/logs/1314_54_kk-64671.lz4.inprogress
-rw-rw   3  ooo  
11958 2017-01-20 17:55 
/user/hadoop/xx/spark/logs/1315_1667_kk-58968.lz4.inprogress
-rw-rw   3  ooo  
11960 2017-01-20 17:55 
/user/hadoop/xx/spark/logs/1316_54_kk-48058.lz4.inprogress

Based on the current logic, clean up candidates are skipped in several cases:
1. if a file has 0 bytes, it is completely ignored
2. if a file is in progress and not paresable/can't extract appID, is it 
completely ignored
3. if a file is complete and but not parseable/can't extract appID, it is 
completely ignored.

To address this edge case and provide a way to clean out orphaned history files 
I propose a new configuration option:

spark.history.fs.cleaner.aggressive={true, false}, default is false.

If true, the history server will more aggressively garbage collect history 
files in cases (1), (2) and (3).  Since the default is false, existing 
customers won't be affected unless they explicitly opt-in.  If customers have 
similar leaking garbage over time they have the option of aggressively cleaning 
up in such cases.  Also note that aggressive clean up may not be appropriate 
for some customers if they have long running jobs that exceed the 
cleaner.maxAge time frame and/or have buggy file systems.

Would like to get feedback on if this seems like a reasonable solution.


  was:
We have noticed that history server logs are sometimes never cleaned up.  The 
current history server logic *ONLY* cleans up history files if they are 
completed since in general it doesn't make sense to clean up inprogress history 
files (after all, the job is presumably still running?)  Note that inprogress 
history files would generally not be targeted for clean up any way assuming 
they regularly flush logs and the file system accurately updates the history 
log last modified time/size, while this is likely it is not guaranteed behavior.

As a consequence of the current clean up logic and a combination of unclean 
shutdowns, various file system bugs, earlier spark bugs, etc. we have 
accumulated thousands of these dead history files associated with long since 
gone jobs.

For example (with spark.history.fs.cleaner.maxAge=14d):

-rw-rw   3 xx   ooo  
14382 2016-09-13 15:40 
/user/hadoop/xx/spark/logs/qq1974_ppp-8812_11058600195_dev4384_-53982.zstandard
-rw-rw   3  ooo   
5933 2016-11-01 20:16 
/user/

[jira] [Created] (SPARK-21571) Spark history server leaves incomplete or unreadable history files around forever.

2017-07-28 Thread Eric Vandenberg (JIRA)
Eric Vandenberg created SPARK-21571:
---

 Summary: Spark history server leaves incomplete or unreadable 
history files around forever.
 Key: SPARK-21571
 URL: https://issues.apache.org/jira/browse/SPARK-21571
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.2.0
Reporter: Eric Vandenberg
Priority: Minor


We have noticed that history server logs are sometimes never cleaned up.  The 
current history server logic *ONLY* cleans up history files if they are 
completed since in general it doesn't make sense to clean up inprogress history 
files (after all, the job is presumably still running?)  Note that inprogress 
history files would generally not be targeted for clean up any way assuming 
they regularly flush logs and the file system accurately updates the history 
log last modified time/size, while this is likely it is not guaranteed behavior.

As a consequence of the current clean up logic and a combination of unclean 
shutdowns, various file system bugs, earlier spark bugs, etc. we have 
accumulated thousands of these dead history files associated with long since 
gone jobs.

For example (with spark.history.fs.cleaner.maxAge=14d):

-rw-rw   3 xx   ooo  
14382 2016-09-13 15:40 
/user/hadoop/xx/spark/logs/qq1974_ppp-8812_11058600195_dev4384_-53982.zstandard
-rw-rw   3  ooo   
5933 2016-11-01 20:16 
/user/hadoop/xx/spark/logs/qq2016_ppp-8812_12650700673_dev5365_-65313.lz4
-rw-rw   3 yyy  ooo 
 0 2017-01-19 11:59 
/user/hadoop/xx/spark/logs/0057_326_m-57863.lz4.inprogress
-rw-rw   3 xooo 
 0 2017-01-19 14:17 
/user/hadoop/xx/spark/logs/0063_688_m-33246.lz4.inprogress
-rw-rw   3 yyy  ooo 
 0 2017-01-20 10:56 
/user/hadoop/xx/spark/logs/1030_326_m-45195.lz4.inprogress
-rw-rw   3  ooo  
11955 2017-01-20 17:55 
/user/hadoop/xx/spark/logs/1314_54_kk-64671.lz4.inprogress
-rw-rw   3  ooo  
11958 2017-01-20 17:55 
/user/hadoop/xx/spark/logs/1315_1667_kk-58968.lz4.inprogress
-rw-rw   3  ooo  
11960 2017-01-20 17:55 
/user/hadoop/xx/spark/logs/1316_54_kk-48058.lz4.inprogress

Based on the current logic, clean up candidates are skipped in several cases:
1. if a file has 0 bytes, it is completely ignored
2. if a file is in progress, is it completely ignored
3. if a file is complete and but not parseable, or can't extract appID, it is 
completely ignored.

To address this edge case and provide a way to clean out orphaned history files 
I propose a new configuration option:

spark.history.fs.cleaner.aggressive={true, false}, default is false.

If true, the history server will more aggressively garbage collect history 
files in cases (1), (2) and (3).  Since the default is false, existing 
customers won't be affected unless they explicitly opt-in.  If customers have 
similar leaking garbage over time they have the option of aggressively cleaning 
up in such cases.  Also note that aggressive clean up may not be appropriate 
for some customers if they have long running jobs that exceed the 
cleaner.maxAge time frame and/or have buggy file systems.

Would like to get feedback on if this seems like a reasonable solution.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11170) ​ EOFException on History server reading in progress lz4

2017-07-25 Thread Eric Vandenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100537#comment-16100537
 ] 

Eric Vandenberg commented on SPARK-11170:
-

There's a fix for this, see https://issues.apache.org/jira/browse/SPARK-21447 
it's been reviewed/tested and pending merge.  

> ​ EOFException on History server reading in progress lz4
> 
>
> Key: SPARK-11170
> URL: https://issues.apache.org/jira/browse/SPARK-11170
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
> Environment: HDP: 2.3.2.0-2950 (Hadoop 2.7.1.2.3.2.0-2950)
> Spark: 1.5.x (c27e1904)
>Reporter: Sebastian YEPES FERNANDEZ
>
> The Spark​ ​History server is not able to read/save the jobs history if Spark 
> is configured to use 
> "spark.io.compression.codec=org.apache.spark.io.LZ4CompressionCodec", it 
> continuously generated the following error:
> {code}
> ERROR 2015-10-16 16:21:39 org.apache.spark.deploy.history.FsHistoryProvider: 
> Exception encountered when attempting to load application log 
> hdfs://DATA/user/spark/his
> tory/application_1444297190346_0073_1.lz4.inprogress
> java.io.EOFException: Stream ended prematurely
> at 
> net.jpountz.lz4.LZ4BlockInputStream.readFully(LZ4BlockInputStream.java:218)
> at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:150)
> at 
> net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:117)
> at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
> at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
> at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
> at java.io.InputStreamReader.read(InputStreamReader.java:184)
> at java.io.BufferedReader.fill(BufferedReader.java:161)
> at java.io.BufferedReader.readLine(BufferedReader.java:324)
> at java.io.BufferedReader.readLine(BufferedReader.java:389)
> at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> INFO 2015-10-16 16:21:39 org.apache.spark.deploy.history.FsHistoryProvider: 
> Replaying log path: 
> hdfs://DATA/user/spark/history/application_1444297190346_0072_1.lz4.i
> nprogress
> {code}
> As a workaround setting 
> "spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec​"​ 
> makes the​ ​History server work correctly



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21447) Spark history server fails to render compressed inprogress history file in some cases.

2017-07-17 Thread Eric Vandenberg (JIRA)
Eric Vandenberg created SPARK-21447:
---

 Summary: Spark history server fails to render compressed 
inprogress history file in some cases.
 Key: SPARK-21447
 URL: https://issues.apache.org/jira/browse/SPARK-21447
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.0.0
 Environment: Spark History Server
Reporter: Eric Vandenberg
Priority: Minor


We've observed the Spark History Server sometimes fails to load event data from 
a compressed .inprogress spark history file.  Note the existing logic in 
ReplayListenerBus is to read each line, if it can't json parse the last line 
and it's inprogress (maybeTruncated) then it is accepted as best effort.

In the case of compressed files, the output stream will compress on the fly 
json serialized event data.  The output is periodically flushed to disk when 
internal buffers are full.  A consequence of that is a partially compressed 
frame may be flushed, and not being a complete frame, it can not be 
decompressed.  If the spark history server attempts to read such an .inprogress 
compressed file it will throw an EOFException.  This is really analogous to the 
case of failing to json parse the last line in the file (because the full line 
was not flushed), the difference is that since the file is compressed it is 
possible the compression frame was not flushed, and trying to decompress a 
partial frame fails in a different way the code doesn't currently handle.

17/07/13 17:24:59 ERROR FsHistoryProvider: Exception encountered when 
attempting to load application log 
hdfs:///user/hadoop/**/spark/logs/job_**-*-*.lz4.inprogress
java.io.EOFException: Stream ended prematurely
at 
org.apache.spark.io.LZ4BlockInputStream.readFully(LZ4BlockInputStream.java:230)
at 
org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:203)
at 
org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at 
scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:66)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:601)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:409)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:310)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting

2017-06-26 Thread Eric Vandenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21219:

Description: 
When a task fails it is (1) added into the pending task list and then (2) 
corresponding black list policy is enforced (ie, specifying if it can/can't run 
on a particular node/executor/etc.)  Unfortunately the ordering is such that 
retrying the task could assign the task to the same executor, which, 
incidentally could be shutting down and immediately fail the retry.   Instead 
the order should be (1) the black list state should be updated and then (2) the 
task assigned, ensuring that the black list policy is properly enforced.

The attached logs demonstrate the race condition.

See spark_executor.log.anon:

1. Task 55.2 fails on the executor

17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 
39575)
java.lang.OutOfMemoryError: Java heap space

2. Immediately the same executor is assigned the retry task:

17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)

3. The retry task of course fails since the executor is also shutting down due 
to the original task 55.2 OOM failure.

See the spark_driver.log.anon:

The driver processes the lost task 55.2:

17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, 
foobar.masked-server.com, executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0):
 java.lang.OutOfMemoryError: Java heap space

The driver then receives the ExecutorLostFailure for the retry task 55.3 
(although it's obfuscated in these logs, the server info is same...)

17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, 
foobar.masked-server.com, executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0):
 ExecutorLostFailure (executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0
 exited caused by one of the running tasks) Reason: Remote RPC client 
disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.


  was:
When a task fails it is added into the pending task list and corresponding 
black list policy is enforced (ie, specifying if it can/can't run on a 
particular node/executor/etc.)  Unfortunately the ordering is such that 
retrying the task could assign the task to the same executor, which, 
incidentally could be shutting down and immediately fail the retry.   Instead 
the black list state should be updated and then the task assigned, ensuring 
that the black list policy is properly enforced.

The attached logs demonstrate the race condition.

See spark_executor.log.anon:

1. Task 55.2 fails on the executor

17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 
39575)
java.lang.OutOfMemoryError: Java heap space

2. Immediately the same executor is assigned the retry task:

17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)

3. The retry task of course fails since the executor is also shutting down due 
to the original task 55.2 OOM failure.

See the spark_driver.log.anon:

The driver processes the lost task 55.2:

17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, 
foobar.masked-server.com, executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0):
 java.lang.OutOfMemoryError: Java heap space

The driver then receives the ExecutorLostFailure for the retry task 55.3 
(although it's obfuscated in these logs, the server info is same...)

17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, 
foobar.masked-server.com, executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0):
 ExecutorLostFailure (executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0
 exited caused by one of the running tasks) Reason: Remote RPC client 
disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.



> Task retry occurs on same executor due to race condition with blacklisting
> --
>
> Key: SPARK-21219
> URL: https://issues.apache.org/jira/browse/SPARK-21219
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: spark_driver.log.anon, spark_executor.log.anon
>
>
> W

[jira] [Updated] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting

2017-06-26 Thread Eric Vandenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21219:

Description: 
When a task fails it is added into the pending task list and corresponding 
black list policy is enforced (ie, specifying if it can/can't run on a 
particular node/executor/etc.)  Unfortunately the ordering is such that 
retrying the task could assign the task to the same executor, which, 
incidentally could be shutting down and immediately fail the retry.   Instead 
the black list state should be updated and then the task assigned, ensuring 
that the black list policy is properly enforced.

The attached logs demonstrate the race condition.

See spark_executor.log.anon:

1. Task 55.2 fails on the executor

17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 
39575)
java.lang.OutOfMemoryError: Java heap space

2. Immediately the same executor is assigned the retry task:

17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)

3. The retry task of course fails since the executor is also shutting down due 
to the original task 55.2 OOM failure.

See the spark_driver.log.anon:

The driver processes the lost task 55.2:

17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, 
foobar.masked-server.com, executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0):
 java.lang.OutOfMemoryError: Java heap space

The driver then receives the ExecutorLostFailure for the retry task 55.3 
(although it's obfuscated in these logs, the server info is same...)

17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, 
foobar.masked-server.com, executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0):
 ExecutorLostFailure (executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0
 exited caused by one of the running tasks) Reason: Remote RPC client 
disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.


  was:
When a task fails it is added into the pending task list and corresponding 
black list policy is enforced (ie, specifying if it can/can't run on a 
particular node/executor/etc.)  Unfortunately the ordering is such that 
retrying the task could assign the task to the same executor, which, 
incidentally could be shutting down and immediately fail the retry.   Instead 
the black list state should be updated and then the task assigned, ensuring 
that the black list policy is properly enforced.



> Task retry occurs on same executor due to race condition with blacklisting
> --
>
> Key: SPARK-21219
> URL: https://issues.apache.org/jira/browse/SPARK-21219
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: spark_driver.log.anon, spark_executor.log.anon
>
>
> When a task fails it is added into the pending task list and corresponding 
> black list policy is enforced (ie, specifying if it can/can't run on a 
> particular node/executor/etc.)  Unfortunately the ordering is such that 
> retrying the task could assign the task to the same executor, which, 
> incidentally could be shutting down and immediately fail the retry.   Instead 
> the black list state should be updated and then the task assigned, ensuring 
> that the black list policy is properly enforced.
> The attached logs demonstrate the race condition.
> See spark_executor.log.anon:
> 1. Task 55.2 fails on the executor
> 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 
> 39575)
> java.lang.OutOfMemoryError: Java heap space
> 2. Immediately the same executor is assigned the retry task:
> 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
> 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)
> 3. The retry task of course fails since the executor is also shutting down 
> due to the original task 55.2 OOM failure.
> See the spark_driver.log.anon:
> The driver processes the lost task 55.2:
> 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 
> 39575, foobar.masked-server.com, executor 
> attempt_foobar.masked-server.com-___.masked-server.com-____0):
>  java.lang.OutOfMemoryError: Java heap space
> The driver then receives the ExecutorLostFailure for the retry task 55.3 
> (although it's obfuscated in these logs, the server info is same...)
> 17/06/20 13:25:10 WARN TaskSetManager: Lost 

[jira] [Updated] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting

2017-06-26 Thread Eric Vandenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21219:

Attachment: spark_executor.log.anon
spark_driver.log.anon

> Task retry occurs on same executor due to race condition with blacklisting
> --
>
> Key: SPARK-21219
> URL: https://issues.apache.org/jira/browse/SPARK-21219
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: spark_driver.log.anon, spark_executor.log.anon
>
>
> When a task fails it is added into the pending task list and corresponding 
> black list policy is enforced (ie, specifying if it can/can't run on a 
> particular node/executor/etc.)  Unfortunately the ordering is such that 
> retrying the task could assign the task to the same executor, which, 
> incidentally could be shutting down and immediately fail the retry.   Instead 
> the black list state should be updated and then the task assigned, ensuring 
> that the black list policy is properly enforced.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting

2017-06-26 Thread Eric Vandenberg (JIRA)
Eric Vandenberg created SPARK-21219:
---

 Summary: Task retry occurs on same executor due to race condition 
with blacklisting
 Key: SPARK-21219
 URL: https://issues.apache.org/jira/browse/SPARK-21219
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.1.1
Reporter: Eric Vandenberg
Priority: Minor


When a task fails it is added into the pending task list and corresponding 
black list policy is enforced (ie, specifying if it can/can't run on a 
particular node/executor/etc.)  Unfortunately the ordering is such that 
retrying the task could assign the task to the same executor, which, 
incidentally could be shutting down and immediately fail the retry.   Instead 
the black list state should be updated and then the task assigned, ensuring 
that the black list policy is properly enforced.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21155) Add (? running tasks) into Spark UI progress

2017-06-23 Thread Eric Vandenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061225#comment-16061225
 ] 

Eric Vandenberg commented on SPARK-21155:
-

Added screen shot with skipped tasks for reference.

> Add (? running tasks) into Spark UI progress
> 
>
> Key: SPARK-21155
> URL: https://issues.apache.org/jira/browse/SPARK-21155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot 
> 2017-06-20 at 3.40.39 PM.png, Screen Shot 2017-06-22 at 9.58.08 AM.png
>
>
> The progress UI for Active Jobs / Tasks should show the number of exact 
> number of running tasks.  See screen shot attachment for what this looks like.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21155) Add (? running tasks) into Spark UI progress

2017-06-22 Thread Eric Vandenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21155:

Attachment: Screen Shot 2017-06-22 at 9.58.08 AM.png

> Add (? running tasks) into Spark UI progress
> 
>
> Key: SPARK-21155
> URL: https://issues.apache.org/jira/browse/SPARK-21155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot 
> 2017-06-20 at 3.40.39 PM.png, Screen Shot 2017-06-22 at 9.58.08 AM.png
>
>
> The progress UI for Active Jobs / Tasks should show the number of exact 
> number of running tasks.  See screen shot attachment for what this looks like.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21155) Add (? running tasks) into Spark UI progress

2017-06-21 Thread Eric Vandenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21155:

Comment: was deleted

(was: Before )

> Add (? running tasks) into Spark UI progress
> 
>
> Key: SPARK-21155
> URL: https://issues.apache.org/jira/browse/SPARK-21155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot 
> 2017-06-20 at 3.40.39 PM.png
>
>
> The progress UI for Active Jobs / Tasks should show the number of exact 
> number of running tasks.  See screen shot attachment for what this looks like.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21155) Add (? running tasks) into Spark UI progress

2017-06-20 Thread Eric Vandenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21155:

Attachment: Screen Shot 2017-06-20 at 3.40.39 PM.png

Before 

> Add (? running tasks) into Spark UI progress
> 
>
> Key: SPARK-21155
> URL: https://issues.apache.org/jira/browse/SPARK-21155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot 
> 2017-06-20 at 3.40.39 PM.png
>
>
> The progress UI for Active Jobs / Tasks should show the number of exact 
> number of running tasks.  See screen shot attachment for what this looks like.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21155) Add (? running tasks) into Spark UI progress

2017-06-20 Thread Eric Vandenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056500#comment-16056500
 ] 

Eric Vandenberg commented on SPARK-21155:
-

The pr is @ https://github.com/apache/spark/pull/18369/files

The current view doesn't not contain the "(? running tasks)" text in the 
progress UI under Tasks.

> Add (? running tasks) into Spark UI progress
> 
>
> Key: SPARK-21155
> URL: https://issues.apache.org/jira/browse/SPARK-21155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png
>
>
> The progress UI for Active Jobs / Tasks should show the number of exact 
> number of running tasks.  See screen shot attachment for what this looks like.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21155) Add (? running tasks) into Spark UI progress

2017-06-20 Thread Eric Vandenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21155:

Attachment: Screen Shot 2017-06-20 at 12.32.58 PM.png

> Add (? running tasks) into Spark UI progress
> 
>
> Key: SPARK-21155
> URL: https://issues.apache.org/jira/browse/SPARK-21155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png
>
>
> The progress UI for Active Jobs / Tasks should show the number of exact 
> number of running tasks.  See screen shot attachment for what this looks like.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21155) Add (? running tasks) into Spark UI progress

2017-06-20 Thread Eric Vandenberg (JIRA)
Eric Vandenberg created SPARK-21155:
---

 Summary: Add (? running tasks) into Spark UI progress
 Key: SPARK-21155
 URL: https://issues.apache.org/jira/browse/SPARK-21155
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.1.1
Reporter: Eric Vandenberg
Priority: Minor
 Fix For: 2.3.0


The progress UI for Active Jobs / Tasks should show the number of exact number 
of running tasks.  See screen shot attachment for what this looks like.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20778) Implement array_intersect function

2017-05-16 Thread Eric Vandenberg (JIRA)
Eric Vandenberg created SPARK-20778:
---

 Summary: Implement array_intersect function
 Key: SPARK-20778
 URL: https://issues.apache.org/jira/browse/SPARK-20778
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Eric Vandenberg
Priority: Minor


Implement an array_intersect function that takes array arguments and returns an 
array containing all elements of the first array that is common with the 
remaining arrays.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org