[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15749592#comment-15749592 ]
Dmitry Buzolin commented on SPARK-18085: ---------------------------------------- I meant to say the discussion was becoming not productive since you have your own definition of orthogonality... Here is my definition: This is result of heap dump of SHS process repeated every 30 seconds while I pressed a link on SHS application page: Do you see how you store those char[] objects in the heap until OOM happens? During this time my browser was "hanging" on the HTTP response. num #instances #bytes class name ---------------------------------------------- 1: 13075420 578500976 [C 1: 15799820 653388056 [C 1: 21342880 1117613800 [C 1: 23314556 1065313544 [C 1: 30900112 1380367768 [C 1: 43923118 1974655888 [C 1: 45056919 1635108368 [C 1: 49365245 1867236600 [C 1: 50455326 1894170920 [C 1: 53344480 1925798464 [C 1: 55918048 2013593472 [C 1: 57219355 2113012528 [C 1: 61683961 2219073304 [C 1: 64389451 2312154896 [C Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit exceeded (of class java.lang.OutOfMemoryError) at org.apache.spark.deploy.history.HistoryServer.org$apache$spark$deploy$history$HistoryServer$$loadAppUi(HistoryServer.scala:204) at org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:79) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) 1: 737101 83175064 [C 1: 2631037 463742576 [C 1: 2305651 408542248 [C Absolutely same behaviour happens when I do a REST call: curl ... http://shs_node:18088/api/v1/applications/application_1479223266604_3123/executors So, yes, you do store JSON (or UI response in the SHS memory). And yes, JSON is not efficient storage format for logs because 70% of the data is repeated key names. While I agree, JSON per se, is not a root cause of such behavior (one could have the same problem with CSV or whatever format) it quickly magnifies the issue because it is residually storing keys. In addition to my suggestions above I would propose configurable settings for logging levels for example if we don't want to log Tasks details we should have an option to turn it off. Also when request is made to get Executor details it shouldn't hang like it's doing now, rather it should provide response indicating that Job/Task aggregation information is not available. I can only guess what is root cause for this, but I have noticed this always happens when there are thousands - 100ds of thousands of tasks spawned during application execution. So, we do need more intelligent SHS logging facility, more configurable, which doesn't take too much of cluster resources to perform aggregation. Good luck building better SHS! > Better History Server scalability for many / large applications > --------------------------------------------------------------- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI > Affects Versions: 2.0.0 > Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org