This is an automated email from the ASF dual-hosted git repository.
roryqi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle.git
The following commit(s) were added to refs/heads/master by this push:
new 3ba56c54 [ISSUE-378][HugePartition][Part-4] Supplement doc about huge
partitions (#505)
3ba56c54 is described below
commit 3ba56c5496f39032cd723906e1776ec129d3cc01
Author: Junfan Zhang <[email protected]>
AuthorDate: Sun Jan 22 21:34:31 2023 +0800
[ISSUE-378][HugePartition][Part-4] Supplement doc about huge partitions
(#505)
### What changes were proposed in this pull request?
Add doc about huge partitions
### Why are the changes needed?
Guide for users
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Don't need
---
docs/server_guide.md | 50 ++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 48 insertions(+), 2 deletions(-)
diff --git a/docs/server_guide.md b/docs/server_guide.md
index 09decaae..c646d03c 100644
--- a/docs/server_guide.md
+++ b/docs/server_guide.md
@@ -86,8 +86,6 @@ This document will introduce how to deploy Uniffle shuffle
servers.
|rss.server.leak.shuffledata.check.interval|3600000|The interval of leak
shuffle data check (ms)|
|rss.server.max.concurrency.of.single.partition.writer|1|The max concurrency
of single partition writer, the data partition file number is equal to this
value. Default value is 1. This config could improve the writing speed,
especially for huge partition.|
|rss.metrics.reporter.class|-|The class of metrics reporter.|
-|rss.server.huge-partition.size.threshold|20g|Threshold of huge partition
size, once exceeding threshold, memory usage limitation and huge partition
buffer flushing will be triggered.|
-|rss.server.huge-partition.memory.limit.ratio|0.2|The memory usage limit ratio
for huge partition, it will only triggered when partition's size exceeds the
threshold of 'rss.server.huge-partition.size.threshold'|
### Advanced Configurations
|Property Name|Default| Description
|
@@ -104,3 +102,51 @@ PrometheusPushGatewayMetricReporter is one of the built-in
metrics reporter, whi
|rss.metrics.prometheus.pushgateway.groupingkey|-| Specifies the grouping key
which is the group and global labels of all metrics. The label name and value
are separated by '=', and labels are separated by ';', e.g., k1=v1;k2=v2.
Please ensure that your grouping key meets the [Prometheus
requirements](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels).
|
|rss.metrics.prometheus.pushgateway.jobname|-| The job name under which
metrics will be pushed.
|
|rss.metrics.prometheus.pushgateway.report.interval.seconds|10| The interval
in seconds for the reporter to report metrics.
|
+
+### Huge Partition Optimization
+A huge partition is a common problem for Spark/MR and so on, caused by data
skew. And it can cause the shuffle server to become unstable. To solve this, we
introduce some mechanisms to limit the writing of huge partitions to avoid
affecting regular partitions, more details can be found in
[ISSUE-378](https://github.com/apache/incubator-uniffle/issues/378). The basic
rules for limiting large partitions are memory usage limits and flushing
individual buffers directly to persistent storage.
+
+#### Memory usage limit
+To do this, we introduce the extra configs
+
+|Property Name|Default|Description|
+|---|---|---|
+|rss.server.huge-partition.size.threshold|20g|Threshold of huge partition
size, once exceeding threshold, memory usage limitation and huge partition
buffer flushing will be triggered. This value depends on the capacity of per
disk in shuffle server. For example, per disk capacity is 1TB, and the max size
of huge partition in per disk is 5. So the total size of huge partition in
local disk is 100g (10%),this is an acceptable config value. Once reaching this
threshold, it will be better to [...]
+to HDFS directly, which could be handled by multiple storage manager fallback
strategy|
+|rss.server.huge-partition.memory.limit.ratio|0.2|The memory usage limit ratio
for huge partition, it will only triggered when partition's size exceeds the
threshold of 'rss.server.huge-partition.size.threshold'. If the buffer capacity
is 10g, this means the default memory usage for huge partition is 2g. Samely,
this config value depends on max size of huge partitions on per shuffle server.|
+
+#### Data flush
+Once the huge partition threshold is reached, the partition is marked as a
huge partition. And then single buffer flush is triggered (writing to
persistent storage as soon as possible). By default, single buffer flush is
only enabled by configuring `rss.server.single.buffer.flush.enabled', but it's
automatically valid for huge partition.
+
+If you don't use HDFS, the huge partition may be flushed to local disk, which
is dangerous if the partition size is larger than the free disk space.
Therefore, it is recommended to use a mixed storage type, including HDFS or
other distributed file systems.
+
+For HDFS, the conf value of `rss.server.single.buffer.flush.threshold` should
be greater than the value of `rss.server.flush.cold.storage.threshold.size`,
which will flush data directly to HDFS.
+
+Finally, to improve the speed of writing to HDFS for a single partition, the
value of `rss.server.max.concurrency.of.single.partition.writer` and
`rss.server.flush.threadPool.size` could be increased to 10 or 20.
+
+#### Example of server conf
+```
+rss.rpc.server.port 19999
+rss.jetty.http.port 19998
+rss.rpc.executor.size 2000
+rss.storage.type MEMORY_LOCALFILE_HDFS
+rss.coordinator.quorum <coordinatorIp1>:19999,<coordinatorIp2>:19999
+rss.storage.basePath /data1/rssdata,/data2/rssdata....
+rss.server.flush.thread.alive 10
+rss.server.buffer.capacity 40g
+rss.server.read.buffer.capacity 20g
+rss.server.heartbeat.timeout 60000
+rss.server.heartbeat.interval 10000
+rss.rpc.message.max.size 1073741824
+rss.server.preAllocation.expired 120000
+rss.server.commit.timeout 600000
+rss.server.app.expired.withoutHeartbeat 120000
+
+# For huge partitions
+rss.server.flush.threadPool.size 20
+rss.server.flush.cold.storage.threshold.size 128m
+rss.server.single.buffer.flush.threshold 129m
+rss.server.max.concurrency.of.single.partition.writer 20
+rss.server.huge-partition.size.threshold 20g
+rss.server.huge-partition.memory.limit.ratio 0.2
+```
\ No newline at end of file