[DISCUSSION] Incubating Proposal of Firestorm

Jerry Shao Mon, 16 May 2022 06:44:31 -0700

Hi all,

We would like to propose Firestorm[1] as a new Apache incubator project,
you can find the proposal here [2] for more details.

Firestorm is a high performance, general purpose Remote Shuffle Service for
distributed compute engines like Apache Spark
<https://spark.apache.org/>, Apache
Hadoop MapReduce <https://hadoop.apache.org/>, Apache Flink
<https://flink.apache.org/> and so on. We are aiming to make Firestorm a
universal shuffle service for distributed compute engines.

Shuffle is the key part for a distributed compute engine to exchange the
data between distributed tasks, the performance and stability of shuffle
will directly affect the whole job. Current “local file pull-like shuffle
style” has several limitations:

1. Current shuffle is hard to support super large workloads, especially
in a high load environment, the major problem is IO problem (random disk IO
issue, network congestion and timeout).
2. Current shuffle is hard to deploy on the disaggregated compute
storage environment, as disk capacity is quite limited on compute nodes.
3. The constraint of storing shuffle data locally makes it hard to scale
elastically.

Remote Shuffle Service is the key technology for enterprises to build big
data platforms, to expand big data applications to disaggregated,
online-offline hybrid environments, and to solve above problems.

The implementation of Remote Shuffle Service - “Firestorm” - is heavily
adopted in Tencent, and shows its advantages in production. Other
enterprises also adopted or prepared to adopt Firestorm in their
environments.

Firestorm’s key idea is brought from Salfish shuffle
<https://www.researchgate.net/publication/262241541_Sailfish_a_framework_for_large_scale_data_processing>,
it has several key design goals:

1. High performance. Firestorm’s performance is close enough to local
file based shuffle style for small workloads. For large workloads, it is
far better than the current shuffle style.
2. Fault tolerance. Firestorm provides high availability for Coordinated
nodes, and failover for Shuffle nodes.
3. Pluggable. Firestorm is highly pluggable, which could be suited to
different compute engines, different backend storages, and different
wire-protocols.

We believe that Firestorm project will provide the great value for the
community if it is accepted by the Apache incubator.

I will help this project as champion and many thanks to the 3 mentors:

- Junping du (junping...@apache.org)
- Xun liu (liu...@apache.org)
- Zhankun Tang (zt...@apache.org)

[1] https://github.com/Tencent/Firestorm
[2] https://cwiki.apache.org/confluence/display/INCUBATOR/FirestormProposal

Best regards,
Jerry

[DISCUSSION] Incubating Proposal of Firestorm

Reply via email to