[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

Dmitry Buzolin (JIRA) Wed, 14 Dec 2016 13:49:30 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15749592#comment-15749592
 ]


Dmitry Buzolin commented on SPARK-18085:
----------------------------------------

I meant to say the discussion was becoming not productive since you have your 
own definition of orthogonality... Here is my definition:

This is result of heap dump of SHS process repeated every 30 seconds while I 
pressed a link on SHS application page:
Do you see how you store those char[] objects in the heap until OOM happens? 
During this time my browser was "hanging" on the HTTP response.

 num     #instances         #bytes  class name
----------------------------------------------
   1:      13075420      578500976  [C
   1:      15799820      653388056  [C
   1:      21342880     1117613800  [C
   1:      23314556     1065313544  [C
   1:      30900112     1380367768  [C
   1:      43923118     1974655888  [C
   1:      45056919     1635108368  [C
   1:      49365245     1867236600  [C
   1:      50455326     1894170920  [C
   1:      53344480     1925798464  [C
   1:      55918048     2013593472  [C
   1:      57219355     2113012528  [C
   1:      61683961     2219073304  [C
   1:      64389451     2312154896  [C

   Caused by:
scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit exceeded (of 
class java.lang.OutOfMemoryError)
        at 
org.apache.spark.deploy.history.HistoryServer.org$apache$spark$deploy$history$HistoryServer$$loadAppUi(HistoryServer.scala:204)
        at 
org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:79)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
        at 
org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
        
   1:        737101       83175064  [C
   1:       2631037      463742576  [C
   1:       2305651      408542248  [C

Absolutely same behaviour happens when I do a REST call: curl ... 
http://shs_node:18088/api/v1/applications/application_1479223266604_3123/executors

So, yes, you do store JSON (or UI response in the SHS memory). And yes, JSON is 
not efficient storage format for logs because 70% of the data is repeated key 
names.
While I agree, JSON per se, is not a root cause of such behavior (one could 
have the same problem with CSV or whatever format)
it quickly magnifies the issue because it is residually storing keys.
In addition to my suggestions above I would propose configurable settings for 
logging levels for example if we don't want to log Tasks details we should have 
an option to turn it off.
Also when request is made to get Executor details it shouldn't hang like it's 
doing now, rather it should provide response indicating that Job/Task 
aggregation information is not available.
I can only guess what is root cause for this, but I have noticed this always 
happens when there are thousands - 100ds of thousands of tasks spawned during 
application execution.
So, we do need more intelligent SHS logging facility, more configurable, which 
doesn't take too much of cluster resources to perform aggregation.

Good luck building better SHS!

> Better History Server scalability for many / large applications
> ---------------------------------------------------------------
>
>                 Key: SPARK-18085
>                 URL: https://issues.apache.org/jira/browse/SPARK-18085
>             Project: Spark
>          Issue Type: Umbrella
>          Components: Spark Core, Web UI
>    Affects Versions: 2.0.0
>            Reporter: Marcelo Vanzin
>         Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

Reply via email to