[jira] [Updated] (SPARK-18085) Better History Server scalability for many / large applications

2017-06-05 Thread Dmitry Buzolin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Buzolin updated SPARK-18085:
---

Your contact with NYSE has changed, please visit https://www.nyse.com/contact 
or call +1 866 873 7422 or +65 6594 0160 or +852 3962 8100.


> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2016-12-14 Thread Dmitry Buzolin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15749652#comment-15749652
 ] 

Dmitry Buzolin commented on SPARK-18085:


I don't think you read my response with enough attention (I said REST API call 
as well as UI).
The REST API returns JSON, isn't it? This JSON before it is delivered to the 
clinet making a REST call is kept in where? Not in the thin air, in the memory 
of SHS. I have nothing to add at this point, but I feel discouraged to continue 
discussion with you.

Thanks,
Dmitry.

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2016-12-14 Thread Dmitry Buzolin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15749592#comment-15749592
 ] 

Dmitry Buzolin commented on SPARK-18085:


I meant to say the discussion was becoming not productive since you have your 
own definition of orthogonality... Here is my definition:

This is result of heap dump of SHS process repeated every 30 seconds while I 
pressed a link on SHS application page:
Do you see how you store those char[] objects in the heap until OOM happens? 
During this time my browser was "hanging" on the HTTP response.

 num #instances #bytes  class name
--
   1:  13075420  578500976  [C
   1:  15799820  653388056  [C
   1:  21342880 1117613800  [C
   1:  23314556 1065313544  [C
   1:  30900112 1380367768  [C
   1:  43923118 1974655888  [C
   1:  45056919 1635108368  [C
   1:  49365245 1867236600  [C
   1:  50455326 1894170920  [C
   1:  53344480 1925798464  [C
   1:  55918048 2013593472  [C
   1:  57219355 2113012528  [C
   1:  61683961 2219073304  [C
   1:  64389451 2312154896  [C

   Caused by:
scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit exceeded (of 
class java.lang.OutOfMemoryError)
at 
org.apache.spark.deploy.history.HistoryServer.org$apache$spark$deploy$history$HistoryServer$$loadAppUi(HistoryServer.scala:204)
at 
org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:79)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
at 
org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)

   1:737101   83175064  [C
   1:   2631037  463742576  [C
   1:   2305651  408542248  [C

Absolutely same behaviour happens when I do a REST call: curl ... 
http://shs_node:18088/api/v1/applications/application_1479223266604_3123/executors

So, yes, you do store JSON (or UI response in the SHS memory). And yes, JSON is 
not efficient storage format for logs because 70% of the data is repeated key 
names.
While I agree, JSON per se, is not a root cause of such behavior (one could 
have the same problem with CSV or whatever format)
it quickly magnifies the issue because it is residually storing keys.
In addition to my suggestions above I would propose configurable settings for 
logging levels for example if we don't want to log Tasks details we should have 
an option to turn it off.
Also when request is made to get Executor details it shouldn't hang like it's 
doing now, rather it should provide response indicating that Job/Task 
aggregation information is not available.
I can only guess what is root cause for this, but I have noticed this always 
happens when there are thousands - 100ds of thousands of tasks spawned during 
application execution.
So, we do need more intelligent SHS logging facility, more configurable, which 
doesn't take too much of cluster resources to perform aggregation.

Good luck building better SHS!

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2016-12-06 Thread Dmitry Buzolin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15727446#comment-15727446
 ] 

Dmitry Buzolin commented on SPARK-18085:


I posted my comments not to start the endless flame on what is orthogonal and 
what is not.
It is up to you how to use them. I speak from my experience running Spark 
clusters of substantial sizes.
If you think offloading problem from memory to disk storage is a way to go - do 
it. I'd be happy to see SHS performance improvements in next Spark release.


> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18085) Better History Server scalability for many / large applications

2016-12-06 Thread Dmitry Buzolin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725880#comment-15725880
 ] 

Dmitry Buzolin edited comment on SPARK-18085 at 12/6/16 3:58 PM:
-

Spark log size is directly depending on few things:

- the underlying schema-less data format you are using - JSON
- the current logging implementation where the log size is directly dependent 
on the number of tasks

Since SHS keeps this data in memory I don't see how these issues are orthogonal 
to the memory issues in SHS, they are causing them in my opinion. JSON is great 
as data interchange or configuration format it's good for small payloads, but 
using it for logging? I see this first time. I understand you may not change 
this, but it worth keep this in mind and think about it.

Thank you.


was (Author: dbuzolin):
Spark log size is directly depending on few things:

- the underlying schema-less data format you are using - JSON
- the current logging implementation where the log size is directly dependent 
on the number of tasks

Since SHS keeps this data in memory I don't see how these issues are orthogonal 
to the memory issues in SHS, they are causing them in my opinion. JSON is great 
as data interchange or configuration format it's good for small payloads, but 
using it for logging, I honestly saw this first time on last 20 years being in 
IT. I understand you may not change this, but it worth keep this in mind and 
think about it.

Thank you.

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2016-12-06 Thread Dmitry Buzolin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725880#comment-15725880
 ] 

Dmitry Buzolin commented on SPARK-18085:


Spark log size is directly depending on few things:

- the underlying schema-less data format you are using - JSON
- the current logging implementation where the log size is directly dependent 
on the number of tasks

Since SHS keeps this data in memory I don't see how these issues are orthogonal 
to the memory issues in SHS, they are causing them in my opinion. JSON is great 
as data interchange or configuration format it's good for small payloads, but 
using it for logging, I honestly saw this first time on last 20 years being in 
IT. I understand you may not change this, but it worth keep this in mind and 
think about it.

Thank you.

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18085) Better History Server scalability for many / large applications

2016-12-05 Thread Dmitry Buzolin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722425#comment-15722425
 ] 

Dmitry Buzolin edited comment on SPARK-18085 at 12/5/16 2:45 PM:
-

I would like add my observations after working with SHS:

1. The JSON format for logs storage is inefficient and redundant - about 70% of 
information in logs are repeated key names. This reliance on JSON is a dead end 
(perhaps compression may alleviate this at some extent) for such distributed 
architecture as Spark and it would be great if this changed to normal O/S like 
logging or storing logs in a database.

2. The amount of logging in Spark is directly proportional to the number of 
tasks. I've seen 50+ GB log files sitting in HDFS. The design has to be more 
intelligent not to produce such logs, as they slow down the UI, impact 
performance or REST API and can occupy lot of space in HDFS.

3. The Spark REST API should be consistent with regards to log availability and 
information it conveys. Just two examples:
- Many times when Spark application finishes and both Yarn and Spark report 
application as completed via calls into top level endpoint - yet the log file 
is not available via Spark REST API and returns "no such app" message when one 
queries executors or jobs details. This leaves one guessing and waiting before 
query the status of the application.
- When Spark app is running one can clearly see vCores and allocatedMemory for 
running application. However once application completes these parameters are 
reset to -1. Why? Perhaps to indicate that application no longer running and 
occupying any cluster resources. But there are already flags telling us about 
this: "state" and "finalStatus", so why make things more difficult to find out 
how many resource were used for apps which already completed?


was (Author: dbuzolin):
I would like add my observations after working with SHS:

1. The JSON format for logs storage is inefficient and redundant - about 70% of 
information in logs are repeated key names. This reliance on JSON is a dead end 
(perhaps compression may alleviate this at some extent) for such distributed 
architecture as Spark and it would be great if this changed to normal O/S like 
logging or storing logs in a database.

2. The amount of logging in Spark is directly proportional to the number of 
tasks. I've seen 50+ GB log files sitting in HDFS. The design has to be more 
intelligent not to produce such logs, as they slow down the UI, impact 
performance or REST API and can occupy lot of space in HDFS.

3. The Spark REST API should be consistent with regards to log availability. 
Many times when Spark application finishes and both Yarn and Spark report 
application as completed via calls into top level endpoint - yet the log file 
is not available via Spark REST API and returns "no such app" message when one 
queries executors or jobs details. This leaves one guessing and waiting before 
query the status of the application.

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2016-12-05 Thread Dmitry Buzolin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722425#comment-15722425
 ] 

Dmitry Buzolin commented on SPARK-18085:


I would like add my observations after working with SHS:

1. The JSON format for logs storage is inefficient and redundant - about 70% of 
information in logs are repeated key names. This reliance on JSON is a dead end 
(perhaps compression may alleviate this at some extent) for such distributed 
architecture as Spark and it would be great if this changed to normal O/S like 
logging or storing logs in a database.

2. The amount of logging in Spark is directly proportional to the number of 
tasks. I've seen 50+ GB log files sitting in HDFS. The design has to be more 
intelligent not to produce such logs, as they slow down the UI, impact 
performance or REST API and can occupy lot of space in HDFS.

3. The Spark REST API should be consistent with regards to log availability. 
Many times when Spark application finishes and both Yarn and Spark report 
application as completed via calls into top level endpoint - yet the log file 
is not available via Spark REST API and returns "no such app" message when one 
queries executors or jobs details. This leaves one guessing and waiting before 
query the status of the application.

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org