[jira] [Updated] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Buzolin updated SPARK-18085: --- Your contact with NYSE has changed, please visit https://www.nyse.com/contact or call +1 866 873 7422 or +65 6594 0160 or +852 3962 8100. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15749652#comment-15749652 ] Dmitry Buzolin commented on SPARK-18085: I don't think you read my response with enough attention (I said REST API call as well as UI). The REST API returns JSON, isn't it? This JSON before it is delivered to the clinet making a REST call is kept in where? Not in the thin air, in the memory of SHS. I have nothing to add at this point, but I feel discouraged to continue discussion with you. Thanks, Dmitry. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15749592#comment-15749592 ] Dmitry Buzolin commented on SPARK-18085: I meant to say the discussion was becoming not productive since you have your own definition of orthogonality... Here is my definition: This is result of heap dump of SHS process repeated every 30 seconds while I pressed a link on SHS application page: Do you see how you store those char[] objects in the heap until OOM happens? During this time my browser was "hanging" on the HTTP response. num #instances #bytes class name -- 1: 13075420 578500976 [C 1: 15799820 653388056 [C 1: 21342880 1117613800 [C 1: 23314556 1065313544 [C 1: 30900112 1380367768 [C 1: 43923118 1974655888 [C 1: 45056919 1635108368 [C 1: 49365245 1867236600 [C 1: 50455326 1894170920 [C 1: 53344480 1925798464 [C 1: 55918048 2013593472 [C 1: 57219355 2113012528 [C 1: 61683961 2219073304 [C 1: 64389451 2312154896 [C Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit exceeded (of class java.lang.OutOfMemoryError) at org.apache.spark.deploy.history.HistoryServer.org$apache$spark$deploy$history$HistoryServer$$loadAppUi(HistoryServer.scala:204) at org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:79) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) 1:737101 83175064 [C 1: 2631037 463742576 [C 1: 2305651 408542248 [C Absolutely same behaviour happens when I do a REST call: curl ... http://shs_node:18088/api/v1/applications/application_1479223266604_3123/executors So, yes, you do store JSON (or UI response in the SHS memory). And yes, JSON is not efficient storage format for logs because 70% of the data is repeated key names. While I agree, JSON per se, is not a root cause of such behavior (one could have the same problem with CSV or whatever format) it quickly magnifies the issue because it is residually storing keys. In addition to my suggestions above I would propose configurable settings for logging levels for example if we don't want to log Tasks details we should have an option to turn it off. Also when request is made to get Executor details it shouldn't hang like it's doing now, rather it should provide response indicating that Job/Task aggregation information is not available. I can only guess what is root cause for this, but I have noticed this always happens when there are thousands - 100ds of thousands of tasks spawned during application execution. So, we do need more intelligent SHS logging facility, more configurable, which doesn't take too much of cluster resources to perform aggregation. Good luck building better SHS! > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15727446#comment-15727446 ] Dmitry Buzolin commented on SPARK-18085: I posted my comments not to start the endless flame on what is orthogonal and what is not. It is up to you how to use them. I speak from my experience running Spark clusters of substantial sizes. If you think offloading problem from memory to disk storage is a way to go - do it. I'd be happy to see SHS performance improvements in next Spark release. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725880#comment-15725880 ] Dmitry Buzolin edited comment on SPARK-18085 at 12/6/16 3:58 PM: - Spark log size is directly depending on few things: - the underlying schema-less data format you are using - JSON - the current logging implementation where the log size is directly dependent on the number of tasks Since SHS keeps this data in memory I don't see how these issues are orthogonal to the memory issues in SHS, they are causing them in my opinion. JSON is great as data interchange or configuration format it's good for small payloads, but using it for logging? I see this first time. I understand you may not change this, but it worth keep this in mind and think about it. Thank you. was (Author: dbuzolin): Spark log size is directly depending on few things: - the underlying schema-less data format you are using - JSON - the current logging implementation where the log size is directly dependent on the number of tasks Since SHS keeps this data in memory I don't see how these issues are orthogonal to the memory issues in SHS, they are causing them in my opinion. JSON is great as data interchange or configuration format it's good for small payloads, but using it for logging, I honestly saw this first time on last 20 years being in IT. I understand you may not change this, but it worth keep this in mind and think about it. Thank you. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725880#comment-15725880 ] Dmitry Buzolin commented on SPARK-18085: Spark log size is directly depending on few things: - the underlying schema-less data format you are using - JSON - the current logging implementation where the log size is directly dependent on the number of tasks Since SHS keeps this data in memory I don't see how these issues are orthogonal to the memory issues in SHS, they are causing them in my opinion. JSON is great as data interchange or configuration format it's good for small payloads, but using it for logging, I honestly saw this first time on last 20 years being in IT. I understand you may not change this, but it worth keep this in mind and think about it. Thank you. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722425#comment-15722425 ] Dmitry Buzolin edited comment on SPARK-18085 at 12/5/16 2:45 PM: - I would like add my observations after working with SHS: 1. The JSON format for logs storage is inefficient and redundant - about 70% of information in logs are repeated key names. This reliance on JSON is a dead end (perhaps compression may alleviate this at some extent) for such distributed architecture as Spark and it would be great if this changed to normal O/S like logging or storing logs in a database. 2. The amount of logging in Spark is directly proportional to the number of tasks. I've seen 50+ GB log files sitting in HDFS. The design has to be more intelligent not to produce such logs, as they slow down the UI, impact performance or REST API and can occupy lot of space in HDFS. 3. The Spark REST API should be consistent with regards to log availability and information it conveys. Just two examples: - Many times when Spark application finishes and both Yarn and Spark report application as completed via calls into top level endpoint - yet the log file is not available via Spark REST API and returns "no such app" message when one queries executors or jobs details. This leaves one guessing and waiting before query the status of the application. - When Spark app is running one can clearly see vCores and allocatedMemory for running application. However once application completes these parameters are reset to -1. Why? Perhaps to indicate that application no longer running and occupying any cluster resources. But there are already flags telling us about this: "state" and "finalStatus", so why make things more difficult to find out how many resource were used for apps which already completed? was (Author: dbuzolin): I would like add my observations after working with SHS: 1. The JSON format for logs storage is inefficient and redundant - about 70% of information in logs are repeated key names. This reliance on JSON is a dead end (perhaps compression may alleviate this at some extent) for such distributed architecture as Spark and it would be great if this changed to normal O/S like logging or storing logs in a database. 2. The amount of logging in Spark is directly proportional to the number of tasks. I've seen 50+ GB log files sitting in HDFS. The design has to be more intelligent not to produce such logs, as they slow down the UI, impact performance or REST API and can occupy lot of space in HDFS. 3. The Spark REST API should be consistent with regards to log availability. Many times when Spark application finishes and both Yarn and Spark report application as completed via calls into top level endpoint - yet the log file is not available via Spark REST API and returns "no such app" message when one queries executors or jobs details. This leaves one guessing and waiting before query the status of the application. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722425#comment-15722425 ] Dmitry Buzolin commented on SPARK-18085: I would like add my observations after working with SHS: 1. The JSON format for logs storage is inefficient and redundant - about 70% of information in logs are repeated key names. This reliance on JSON is a dead end (perhaps compression may alleviate this at some extent) for such distributed architecture as Spark and it would be great if this changed to normal O/S like logging or storing logs in a database. 2. The amount of logging in Spark is directly proportional to the number of tasks. I've seen 50+ GB log files sitting in HDFS. The design has to be more intelligent not to produce such logs, as they slow down the UI, impact performance or REST API and can occupy lot of space in HDFS. 3. The Spark REST API should be consistent with regards to log availability. Many times when Spark application finishes and both Yarn and Spark report application as completed via calls into top level endpoint - yet the log file is not available via Spark REST API and returns "no such app" message when one queries executors or jobs details. This leaves one guessing and waiting before query the status of the application. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org