[incubator-uniffle] 02/04: [Doc] Update readme with features like multiple remote storage support etc (#192)

jshao Thu, 30 Jun 2022 23:57:53 -0700

This is an automated email from the ASF dual-hosted git repository.

jshao pushed a commit to branch branch-0.5.0
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle.git


commit af92c1ca1339d3353ba3f80d5d97ee0658977397
Author: Colin <coli...@tencent.com>
AuthorDate: Wed Jun 22 17:16:53 2022 +0800

    [Doc] Update readme with features like multiple remote storage support etc 
(#192)
    
    ### What changes were proposed in this pull request?
    Update Readme for latest features, eg, multiple remote storage support, 
dynamic client conf etc.
    
    ### Why are the changes needed?
    Doc should be updated
    
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    No need
---
 README.md | 46 ++++++++++++++++++++++++++++++++++------------
 1 file changed, 34 insertions(+), 12 deletions(-)

diff --git a/README.md b/README.md
index e134f0f..50903ce 100644
--- a/README.md
+++ b/README.md
@@ -11,7 +11,7 @@ Coordinator will collect status of shuffle server and do the 
assignment for the
 
 Shuffle server will receive the shuffle data, merge them and write to storage.
 
-Depend on different situation, Firestorm supports Memory & Local, Memory & 
Remote Storage(eg, HDFS), Local only, Remote Storage only.
+Depend on different situation, Firestorm supports Memory & Local, Memory & 
Remote Storage(eg, HDFS), Memory & Local & Remote Storage(recommendation for 
production environment).
 
 ## Shuffle Process with Firestorm
 
@@ -74,9 +74,25 @@ rss-xxx.tgz will be generated for deployment
      rss.coordinator.server.heartbeat.timeout 30000
      rss.coordinator.app.expired 60000
      rss.coordinator.shuffle.nodes.max 5
-     rss.coordinator.exclude.nodes.file.path RSS_HOME/conf/exclude_nodes
-   ```
-4. start Coordinator
+     # enable dynamicClientConf, and coordinator will be responsible for most 
of client conf
+     rss.coordinator.dynamicClientConf.enabled true
+     # config the path of client conf
+     rss.coordinator.dynamicClientConf.path <RSS_HOME>/conf/dynamic_client.conf
+     # config the path of excluded shuffle server
+     rss.coordinator.exclude.nodes.file.path <RSS_HOME>/conf/exclude_nodes
+   ```
+4. update <RSS_HOME>/conf/dynamic_client.conf, rss client will get default 
conf from coordinator eg,
+   ```
+    # MEMORY_LOCALFILE_HDFS is recommandation for production environment
+    rss.storage.type MEMORY_LOCALFILE_HDFS
+    # multiple remote storages are supported, and client will get assignment 
from coordinator
+    rss.coordinator.remote.storage.path 
hdfs://cluster1/path,hdfs://cluster2/path
+    rss.writer.require.memory.retryMax 1200
+    rss.client.retry.max 100
+    rss.writer.send.check.timeout 600000
+    rss.client.read.buffer.size 14m
+   ```
+5. start Coordinator
    ```
     bash RSS_HOME/bin/start-coordnator.sh
    ```
@@ -90,14 +106,17 @@ rss-xxx.tgz will be generated for deployment
      HADOOP_HOME=<hadoop home>
      XMX_SIZE="80g"
    ```
-3. update RSS_HOME/conf/server.conf, the following demo is for memory + local 
storage only, eg,
+3. update RSS_HOME/conf/server.conf, eg,
    ```
      rss.rpc.server.port 19999
      rss.jetty.http.port 19998
      rss.rpc.executor.size 2000
-     rss.storage.type MEMORY_LOCALFILE
+     # it should be configed the same as in coordinator
+     rss.storage.type MEMORY_LOCALFILE_HDFS
      rss.coordinator.quorum <coordinatorIp1>:19999,<coordinatorIp2>:19999
+     # local storage path for shuffle server
      rss.storage.basePath /data1/rssdata,/data2/rssdata....
+     # it's better to config thread num according to local disk num
      rss.server.flush.thread.alive 5
      rss.server.flush.threadPool.size 10
      rss.server.buffer.capacity 40g
@@ -108,6 +127,10 @@ rss-xxx.tgz will be generated for deployment
      rss.server.preAllocation.expired 120000
      rss.server.commit.timeout 600000
      rss.server.app.expired.withoutHeartbeat 120000
+     # note: the default value of rss.server.flush.cold.storage.threshold.size 
is 64m
+     # there will be no data written to DFS if set it as 100g even 
rss.storage.type=MEMORY_LOCALFILE_HDFS
+     # please set proper value if DFS is used, eg, 64m, 128m.
+     rss.server.flush.cold.storage.threshold.size 100g
    ```
 4. start Shuffle Server
    ```
@@ -121,12 +144,11 @@ rss-xxx.tgz will be generated for deployment
 
    The jar for Spark3 is located in 
<RSS_HOME>/jars/client/spark3/rss-client-XXXXX-shaded.jar
 
-2. Update Spark conf to enable Firestorm, the following demo is for local 
storage only, eg,
+2. Update Spark conf to enable Firestorm, eg,
 
    ```
    spark.shuffle.manager org.apache.spark.shuffle.RssShuffleManager
    spark.rss.coordinator.quorum <coordinatorIp1>:19999,<coordinatorIp2>:19999
-   spark.rss.storage.type MEMORY_LOCALFILE
    ```
 
 ### Support Spark dynamic allocation
@@ -140,17 +162,16 @@ After apply the patch and rebuild spark, add following 
configuration in spark co
   spark.dynamicAllocation.enabled true
   ```
 
-## Deploy MapReduce Client
+### Deploy MapReduce Client
 
 1. Add client jar to the classpath of each NodeManager, e.g., 
<HADOOP>/share/hadoop/mapreduce/
 
 The jar for MapReduce is located in 
<RSS_HOME>/jars/client/mr/rss-client-mr-XXXXX-shaded.jar
 
-2. Update MapReduce conf to enable Firestorm, the following demo is for local 
storage only, eg,
+2. Update MapReduce conf to enable Firestorm, eg,
 
    ```
    
-Dmapreduce.rss.coordinator.quorum=<coordinatorIp1>:19999,<coordinatorIp2>:19999
-   -Dmapreduce.rss.storage.type=MEMORY_LOCALFILE 
    
-Dyarn.app.mapreduce.am.command-opts=org.apache.hadoop.mapreduce.v2.app.RssMRAppMaster
    
-Dmapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.RssMapOutputCollector
    
-Dmapreduce.job.reduce.shuffle.consumer.plugin.class=org.apache.hadoop.mapreduce.task.reduce.RssShuffle
@@ -168,9 +189,10 @@ The important configuration is listed as following.
 |Property Name|Default|        Description|
 |---|---|---|
 |rss.coordinator.server.heartbeat.timeout|30000|Timeout if can't get heartbeat 
from shuffle server|
-|rss.coordinator.assignment.strategy|BASIC|Strategy for assigning shuffle 
server, only BASIC support|
+|rss.coordinator.assignment.strategy|PARTITION_BALANCE|Strategy for assigning 
shuffle server, PARTITION_BALANCE should be used for workload balance|
 |rss.coordinator.app.expired|60000|Application expired time (ms), the 
heartbeat interval should be less than it|
 |rss.coordinator.shuffle.nodes.max|9|The max number of shuffle server when do 
the assignment|
+|rss.coordinator.dynamicClientConf.path|-|The path of configuration file which 
have default conf for rss client|
 |rss.coordinator.exclude.nodes.file.path|-|The path of configuration file 
which have exclude nodes|
 |rss.coordinator.exclude.nodes.check.interval.ms|60000|Update interval (ms) 
for exclude nodes|
 |rss.rpc.server.port|-|RPC port for coordinator|

[incubator-uniffle] 02/04: [Doc] Update readme with features like multiple remote storage support etc (#192)

Reply via email to