[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722425#comment-15722425
 ] 

Dmitry Buzolin edited comment on SPARK-18085 at 12/5/16 2:45 PM:
-----------------------------------------------------------------

I would like add my observations after working with SHS:

1. The JSON format for logs storage is inefficient and redundant - about 70% of 
information in logs are repeated key names. This reliance on JSON is a dead end 
(perhaps compression may alleviate this at some extent) for such distributed 
architecture as Spark and it would be great if this changed to normal O/S like 
logging or storing logs in a database.

2. The amount of logging in Spark is directly proportional to the number of 
tasks. I've seen 50+ GB log files sitting in HDFS. The design has to be more 
intelligent not to produce such logs, as they slow down the UI, impact 
performance or REST API and can occupy lot of space in HDFS.

3. The Spark REST API should be consistent with regards to log availability and 
information it conveys. Just two examples:
- Many times when Spark application finishes and both Yarn and Spark report 
application as completed via calls into top level endpoint - yet the log file 
is not available via Spark REST API and returns "no such app" message when one 
queries executors or jobs details. This leaves one guessing and waiting before 
query the status of the application.
- When Spark app is running one can clearly see vCores and allocatedMemory for 
running application. However once application completes these parameters are 
reset to -1. Why? Perhaps to indicate that application no longer running and 
occupying any cluster resources. But there are already flags telling us about 
this: "state" and "finalStatus", so why make things more difficult to find out 
how many resource were used for apps which already completed?


was (Author: dbuzolin):
I would like add my observations after working with SHS:

1. The JSON format for logs storage is inefficient and redundant - about 70% of 
information in logs are repeated key names. This reliance on JSON is a dead end 
(perhaps compression may alleviate this at some extent) for such distributed 
architecture as Spark and it would be great if this changed to normal O/S like 
logging or storing logs in a database.

2. The amount of logging in Spark is directly proportional to the number of 
tasks. I've seen 50+ GB log files sitting in HDFS. The design has to be more 
intelligent not to produce such logs, as they slow down the UI, impact 
performance or REST API and can occupy lot of space in HDFS.

3. The Spark REST API should be consistent with regards to log availability. 
Many times when Spark application finishes and both Yarn and Spark report 
application as completed via calls into top level endpoint - yet the log file 
is not available via Spark REST API and returns "no such app" message when one 
queries executors or jobs details. This leaves one guessing and waiting before 
query the status of the application.

> Better History Server scalability for many / large applications
> ---------------------------------------------------------------
>
>                 Key: SPARK-18085
>                 URL: https://issues.apache.org/jira/browse/SPARK-18085
>             Project: Spark
>          Issue Type: Umbrella
>          Components: Spark Core, Web UI
>    Affects Versions: 2.0.0
>            Reporter: Marcelo Vanzin
>         Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to