[jira] [Comment Edited] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971594#comment-15971594 ] Marcelo Vanzin edited comment on SPARK-18085 at 4/17/17 8:57 PM: - I'm getting close to a point where I think the code can start to trickle in. I want to wait until 2.2's branch gets going before sending PRs, though. In the meantime, I'm keeping "private PRs" in my fork for each milestone, so it's easy for anybody interested in getting themselves familiar with the code to provide comments: https://github.com/vanzin/spark/pulls At this point, all the UI that the SHS shows is kept in a disk store (that's core + SQL, but not streaming). At this point, since streaming is not shown in the SHS, I'm not planning to touch it (aside from the small changes I made that were required by internal API changes in core). What's left at this point is, from my view: - managing disk space in the SHS so that large number of apps don't cause the SHS to fill local disks - limiting the number of jobs / stages / tasks / etc kept in the store (similar to existing settings, which the code doesn't yet honor) - an in-memory implementation of the store (in case someone wants lower latency or can't / does not want to use the disk store) - more tests, and more testing was (Author: vanzin): I'm getting close to a point where I think the code can start to trickle in. I want to wait until 2.2's branch gets going before sending PRs, though. In the meantime, I'm keeping "private PRs" in my fork for each milestone, so it's easy for anybody interesting in getting themselves familiar with the code to provide comments: https://github.com/vanzin/spark/pulls At this point, all the UI that the SHS shows is kept in a disk store (that's core + SQL, but not streaming). At this point, since streaming is not shown in the SHS, I'm not planning to touch it (aside from the small changes I made that were required by internal API changes in core). What's left at this point is, from my view: - managing disk space in the SHS so that large number of apps don't cause the SHS to fill local disks - limiting the number of jobs / stages / tasks / etc kept in the store (similar to existing settings, which the code doesn't yet honor) - an in-memory implementation of the store (in case someone wants lower latency or can't / does not want to use the disk store) - more tests, and more testing > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749660#comment-15749660 ] Marcelo Vanzin edited comment on SPARK-18085 at 12/14/16 10:14 PM: --- Yes. The REST API returns JSON. That JSON is not read from the event log. The Spark UI also returns large HTML files generated in memory. Yes. There might be enhancements that can be made to that part of the code. But, for the 100th time, that is not what this work is about. If you care about those, *open a new bug*. It's not hard. was (Author: vanzin): Yes. The REST API returns JSON. That JSON is not read from the event log. The Spark UI also returns large HTML files generated in memory. Yes. There might be enhancements that can be made to that part of the code. But, for the 100th, that is not what this work is about. If you care about those, *open a new bug*. It's not hard. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15725880#comment-15725880 ] Dmitry Buzolin edited comment on SPARK-18085 at 12/6/16 3:58 PM: - Spark log size is directly depending on few things: - the underlying schema-less data format you are using - JSON - the current logging implementation where the log size is directly dependent on the number of tasks Since SHS keeps this data in memory I don't see how these issues are orthogonal to the memory issues in SHS, they are causing them in my opinion. JSON is great as data interchange or configuration format it's good for small payloads, but using it for logging? I see this first time. I understand you may not change this, but it worth keep this in mind and think about it. Thank you. was (Author: dbuzolin): Spark log size is directly depending on few things: - the underlying schema-less data format you are using - JSON - the current logging implementation where the log size is directly dependent on the number of tasks Since SHS keeps this data in memory I don't see how these issues are orthogonal to the memory issues in SHS, they are causing them in my opinion. JSON is great as data interchange or configuration format it's good for small payloads, but using it for logging, I honestly saw this first time on last 20 years being in IT. I understand you may not change this, but it worth keep this in mind and think about it. Thank you. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722425#comment-15722425 ] Dmitry Buzolin edited comment on SPARK-18085 at 12/5/16 2:45 PM: - I would like add my observations after working with SHS: 1. The JSON format for logs storage is inefficient and redundant - about 70% of information in logs are repeated key names. This reliance on JSON is a dead end (perhaps compression may alleviate this at some extent) for such distributed architecture as Spark and it would be great if this changed to normal O/S like logging or storing logs in a database. 2. The amount of logging in Spark is directly proportional to the number of tasks. I've seen 50+ GB log files sitting in HDFS. The design has to be more intelligent not to produce such logs, as they slow down the UI, impact performance or REST API and can occupy lot of space in HDFS. 3. The Spark REST API should be consistent with regards to log availability and information it conveys. Just two examples: - Many times when Spark application finishes and both Yarn and Spark report application as completed via calls into top level endpoint - yet the log file is not available via Spark REST API and returns "no such app" message when one queries executors or jobs details. This leaves one guessing and waiting before query the status of the application. - When Spark app is running one can clearly see vCores and allocatedMemory for running application. However once application completes these parameters are reset to -1. Why? Perhaps to indicate that application no longer running and occupying any cluster resources. But there are already flags telling us about this: "state" and "finalStatus", so why make things more difficult to find out how many resource were used for apps which already completed? was (Author: dbuzolin): I would like add my observations after working with SHS: 1. The JSON format for logs storage is inefficient and redundant - about 70% of information in logs are repeated key names. This reliance on JSON is a dead end (perhaps compression may alleviate this at some extent) for such distributed architecture as Spark and it would be great if this changed to normal O/S like logging or storing logs in a database. 2. The amount of logging in Spark is directly proportional to the number of tasks. I've seen 50+ GB log files sitting in HDFS. The design has to be more intelligent not to produce such logs, as they slow down the UI, impact performance or REST API and can occupy lot of space in HDFS. 3. The Spark REST API should be consistent with regards to log availability. Many times when Spark application finishes and both Yarn and Spark report application as completed via calls into top level endpoint - yet the log file is not available via Spark REST API and returns "no such app" message when one queries executors or jobs details. This leaves one guessing and waiting before query the status of the application. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org