[jira] [Comment Edited] (SPARK-18085) Better History Server scalability for many / large applications

2017-04-17 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971594#comment-15971594
 ] 

Marcelo Vanzin edited comment on SPARK-18085 at 4/17/17 8:57 PM:
-

I'm getting close to a point where I think the code can start to trickle in. I 
want to wait until 2.2's branch gets going before sending PRs, though. In the 
meantime, I'm keeping "private PRs" in my fork for each milestone, so it's easy 
for anybody interested in getting themselves familiar with the code to provide 
comments:

https://github.com/vanzin/spark/pulls

At this point, all the UI that the SHS shows is kept in a disk store (that's 
core + SQL, but not streaming). At this point, since streaming is not shown in 
the SHS, I'm not planning to touch it (aside from the small changes I made that 
were required by internal API changes in core).

What's left at this point is, from my view:
- managing disk space in the SHS so that large number of apps don't cause the 
SHS to fill local disks
- limiting the number of jobs / stages / tasks / etc kept in the store (similar 
to existing settings, which the code doesn't yet honor)
- an in-memory implementation of the store (in case someone wants lower latency 
or can't / does not want to use the disk store)
- more tests, and more testing



was (Author: vanzin):
I'm getting close to a point where I think the code can start to trickle in. I 
want to wait until 2.2's branch gets going before sending PRs, though. In the 
meantime, I'm keeping "private PRs" in my fork for each milestone, so it's easy 
for anybody interesting in getting themselves familiar with the code to provide 
comments:

https://github.com/vanzin/spark/pulls

At this point, all the UI that the SHS shows is kept in a disk store (that's 
core + SQL, but not streaming). At this point, since streaming is not shown in 
the SHS, I'm not planning to touch it (aside from the small changes I made that 
were required by internal API changes in core).

What's left at this point is, from my view:
- managing disk space in the SHS so that large number of apps don't cause the 
SHS to fill local disks
- limiting the number of jobs / stages / tasks / etc kept in the store (similar 
to existing settings, which the code doesn't yet honor)
- an in-memory implementation of the store (in case someone wants lower latency 
or can't / does not want to use the disk store)
- more tests, and more testing


> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18085) Better History Server scalability for many / large applications

2016-12-14 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749660#comment-15749660
 ] 

Marcelo Vanzin edited comment on SPARK-18085 at 12/14/16 10:14 PM:
---

Yes. The REST API returns JSON. That JSON is not read from the event log. The 
Spark UI also returns large HTML files generated in memory.

Yes. There might be enhancements that can be made to that part of the code. 
But, for the 100th time, that is not what this work is about. If you care about 
those, *open a new bug*. It's not hard.


was (Author: vanzin):
Yes. The REST API returns JSON. That JSON is not read from the event log. The 
Spark UI also returns large HTML files generated in memory.

Yes. There might be enhancements that can be made to that part of the code. 
But, for the 100th, that is not what this work is about. If you care about 
those, *open a new bug*. It's not hard.

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18085) Better History Server scalability for many / large applications

2016-12-06 Thread Dmitry Buzolin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15725880#comment-15725880
 ] 

Dmitry Buzolin edited comment on SPARK-18085 at 12/6/16 3:58 PM:
-

Spark log size is directly depending on few things:

- the underlying schema-less data format you are using - JSON
- the current logging implementation where the log size is directly dependent 
on the number of tasks

Since SHS keeps this data in memory I don't see how these issues are orthogonal 
to the memory issues in SHS, they are causing them in my opinion. JSON is great 
as data interchange or configuration format it's good for small payloads, but 
using it for logging? I see this first time. I understand you may not change 
this, but it worth keep this in mind and think about it.

Thank you.


was (Author: dbuzolin):
Spark log size is directly depending on few things:

- the underlying schema-less data format you are using - JSON
- the current logging implementation where the log size is directly dependent 
on the number of tasks

Since SHS keeps this data in memory I don't see how these issues are orthogonal 
to the memory issues in SHS, they are causing them in my opinion. JSON is great 
as data interchange or configuration format it's good for small payloads, but 
using it for logging, I honestly saw this first time on last 20 years being in 
IT. I understand you may not change this, but it worth keep this in mind and 
think about it.

Thank you.

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18085) Better History Server scalability for many / large applications

2016-12-05 Thread Dmitry Buzolin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722425#comment-15722425
 ] 

Dmitry Buzolin edited comment on SPARK-18085 at 12/5/16 2:45 PM:
-

I would like add my observations after working with SHS:

1. The JSON format for logs storage is inefficient and redundant - about 70% of 
information in logs are repeated key names. This reliance on JSON is a dead end 
(perhaps compression may alleviate this at some extent) for such distributed 
architecture as Spark and it would be great if this changed to normal O/S like 
logging or storing logs in a database.

2. The amount of logging in Spark is directly proportional to the number of 
tasks. I've seen 50+ GB log files sitting in HDFS. The design has to be more 
intelligent not to produce such logs, as they slow down the UI, impact 
performance or REST API and can occupy lot of space in HDFS.

3. The Spark REST API should be consistent with regards to log availability and 
information it conveys. Just two examples:
- Many times when Spark application finishes and both Yarn and Spark report 
application as completed via calls into top level endpoint - yet the log file 
is not available via Spark REST API and returns "no such app" message when one 
queries executors or jobs details. This leaves one guessing and waiting before 
query the status of the application.
- When Spark app is running one can clearly see vCores and allocatedMemory for 
running application. However once application completes these parameters are 
reset to -1. Why? Perhaps to indicate that application no longer running and 
occupying any cluster resources. But there are already flags telling us about 
this: "state" and "finalStatus", so why make things more difficult to find out 
how many resource were used for apps which already completed?


was (Author: dbuzolin):
I would like add my observations after working with SHS:

1. The JSON format for logs storage is inefficient and redundant - about 70% of 
information in logs are repeated key names. This reliance on JSON is a dead end 
(perhaps compression may alleviate this at some extent) for such distributed 
architecture as Spark and it would be great if this changed to normal O/S like 
logging or storing logs in a database.

2. The amount of logging in Spark is directly proportional to the number of 
tasks. I've seen 50+ GB log files sitting in HDFS. The design has to be more 
intelligent not to produce such logs, as they slow down the UI, impact 
performance or REST API and can occupy lot of space in HDFS.

3. The Spark REST API should be consistent with regards to log availability. 
Many times when Spark application finishes and both Yarn and Spark report 
application as completed via calls into top level endpoint - yet the log file 
is not available via Spark REST API and returns "no such app" message when one 
queries executors or jobs details. This leaves one guessing and waiting before 
query the status of the application.

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org