This is an automated email from the ASF dual-hosted git repository.

roryqi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle.git


The following commit(s) were added to refs/heads/master by this push:
     new 3ba56c54 [ISSUE-378][HugePartition][Part-4] Supplement doc about huge 
partitions (#505)
3ba56c54 is described below

commit 3ba56c5496f39032cd723906e1776ec129d3cc01
Author: Junfan Zhang <[email protected]>
AuthorDate: Sun Jan 22 21:34:31 2023 +0800

    [ISSUE-378][HugePartition][Part-4] Supplement doc about huge partitions 
(#505)
    
    ### What changes were proposed in this pull request?
    Add doc about huge partitions
    
    ### Why are the changes needed?
    Guide for users
    
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Don't need
---
 docs/server_guide.md | 50 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 48 insertions(+), 2 deletions(-)

diff --git a/docs/server_guide.md b/docs/server_guide.md
index 09decaae..c646d03c 100644
--- a/docs/server_guide.md
+++ b/docs/server_guide.md
@@ -86,8 +86,6 @@ This document will introduce how to deploy Uniffle shuffle 
servers.
 |rss.server.leak.shuffledata.check.interval|3600000|The interval of leak 
shuffle data check (ms)|
 |rss.server.max.concurrency.of.single.partition.writer|1|The max concurrency 
of single partition writer, the data partition file number is equal to this 
value. Default value is 1. This config could improve the writing speed, 
especially for huge partition.|
 |rss.metrics.reporter.class|-|The class of metrics reporter.|
-|rss.server.huge-partition.size.threshold|20g|Threshold of huge partition 
size, once exceeding threshold, memory usage limitation and huge partition 
buffer flushing will be triggered.|
-|rss.server.huge-partition.memory.limit.ratio|0.2|The memory usage limit ratio 
for huge partition, it will only triggered when partition's size exceeds the 
threshold of 'rss.server.huge-partition.size.threshold'|
 
 ### Advanced Configurations
 |Property Name|Default| Description                                            
                                                                                
                                                     |
@@ -104,3 +102,51 @@ PrometheusPushGatewayMetricReporter is one of the built-in 
metrics reporter, whi
 |rss.metrics.prometheus.pushgateway.groupingkey|-| Specifies the grouping key 
which is the group and global labels of all metrics. The label name and value 
are separated by '=', and labels are separated by ';', e.g., k1=v1;k2=v2. 
Please ensure that your grouping key meets the [Prometheus 
requirements](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels).
 |
 |rss.metrics.prometheus.pushgateway.jobname|-| The job name under which 
metrics will be pushed.                                                         
                                                                                
                                                                                
                                                             |
 |rss.metrics.prometheus.pushgateway.report.interval.seconds|10| The interval 
in seconds for the reporter to report metrics.                                  
                                                                                
                                                                                
                                                                                
   |
+
+### Huge Partition Optimization
+A huge partition is a common problem for Spark/MR and so on, caused by data 
skew. And it can cause the shuffle server to become unstable. To solve this, we 
introduce some mechanisms to limit the writing of huge partitions to avoid 
affecting regular partitions, more details can be found in 
[ISSUE-378](https://github.com/apache/incubator-uniffle/issues/378). The basic 
rules for limiting large partitions are memory usage limits and flushing 
individual buffers directly to persistent storage.
+
+#### Memory usage limit
+To do this, we introduce the extra configs
+
+|Property Name|Default|Description|
+|---|---|---|
+|rss.server.huge-partition.size.threshold|20g|Threshold of huge partition 
size, once exceeding threshold, memory usage limitation and huge partition 
buffer flushing will be triggered. This value depends on the capacity of per 
disk in shuffle server. For example, per disk capacity is 1TB, and the max size 
of huge partition in per disk is 5. So the total size of huge partition in 
local disk is 100g (10%),this is an acceptable config value. Once reaching this 
threshold, it will be better to [...]
+to HDFS directly, which could be handled by multiple storage manager fallback 
strategy|
+|rss.server.huge-partition.memory.limit.ratio|0.2|The memory usage limit ratio 
for huge partition, it will only triggered when partition's size exceeds the 
threshold of 'rss.server.huge-partition.size.threshold'. If the buffer capacity 
is 10g, this means the default memory usage for huge partition is 2g. Samely, 
this config value depends on max size of huge partitions on per shuffle server.|
+
+#### Data flush
+Once the huge partition threshold is reached, the partition is marked as a 
huge partition. And then single buffer flush is triggered (writing to 
persistent storage as soon as possible). By default, single buffer flush is 
only enabled by configuring `rss.server.single.buffer.flush.enabled', but it's 
automatically valid for huge partition. 
+
+If you don't use HDFS, the huge partition may be flushed to local disk, which 
is dangerous if the partition size is larger than the free disk space. 
Therefore, it is recommended to use a mixed storage type, including HDFS or 
other distributed file systems.
+
+For HDFS, the conf value of `rss.server.single.buffer.flush.threshold` should 
be greater than the value of `rss.server.flush.cold.storage.threshold.size`, 
which will flush data directly to HDFS. 
+
+Finally, to improve the speed of writing to HDFS for a single partition, the 
value of `rss.server.max.concurrency.of.single.partition.writer` and 
`rss.server.flush.threadPool.size` could be increased to 10 or 20.
+
+#### Example of server conf
+```
+rss.rpc.server.port 19999
+rss.jetty.http.port 19998
+rss.rpc.executor.size 2000
+rss.storage.type MEMORY_LOCALFILE_HDFS
+rss.coordinator.quorum <coordinatorIp1>:19999,<coordinatorIp2>:19999
+rss.storage.basePath /data1/rssdata,/data2/rssdata....
+rss.server.flush.thread.alive 10
+rss.server.buffer.capacity 40g
+rss.server.read.buffer.capacity 20g
+rss.server.heartbeat.timeout 60000
+rss.server.heartbeat.interval 10000
+rss.rpc.message.max.size 1073741824
+rss.server.preAllocation.expired 120000
+rss.server.commit.timeout 600000
+rss.server.app.expired.withoutHeartbeat 120000
+
+# For huge partitions
+rss.server.flush.threadPool.size 20
+rss.server.flush.cold.storage.threshold.size 128m
+rss.server.single.buffer.flush.threshold 129m
+rss.server.max.concurrency.of.single.partition.writer 20
+rss.server.huge-partition.size.threshold 20g
+rss.server.huge-partition.memory.limit.ratio 0.2
+```
\ No newline at end of file

Reply via email to