[incubator-uniffle-website] branch master updated: [878019401] add config build tuning docs (#15)

zhifgli Wed, 26 Oct 2022 00:38:46 -0700

This is an automated email from the ASF dual-hosted git repository.

zhifgli pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 9a50ac0  [878019401] add config build tuning docs (#15)
9a50ac0 is described below

commit 9a50ac05dda2142e2c5328ebfaff8246054b4383
Author: macduan <[email protected]>
AuthorDate: Wed Oct 26 15:38:36 2022 +0800

    [878019401] add config build tuning docs (#15)
    
    Co-authored-by: macduan <[email protected]>
---
 docs/06-configuration.md | 185 +++++++++++++++++++++++++++++++++++++++++++++++
 docs/07-build.md         |  43 +++++++++++
 docs/08-tuning.md        |  21 ++++++
 3 files changed, 249 insertions(+)

diff --git a/docs/06-configuration.md b/docs/06-configuration.md
new file mode 100644
index 0000000..08fa767
--- /dev/null
+++ b/docs/06-configuration.md
@@ -0,0 +1,185 @@
+# Uniffle Configuration
+
+## Common Settings
+
+>Common settings are common settings both used by coordinators and shuffle 
servers
+
+|Property Name|Default|        Description|
+|---|---|---|
+|rss.coordinator.quorum|-|Coordinator quorum list|
+|rss.rpc.server.type | | GRPC | rpc service type |
+|rss.rpc.server.port|-|rpc service port for the coordinator and the shuffle 
server|
+|rss.rpc.message.max.size|1073741824|Max size of rpc message|
+|rss.rpc.metrics.enabled|true|enable rpc related metrics|
+|rss.jetty.http.port||metrics service port|
+|rss.jetty.corePool.size|256|metrcis service jetty corePool size|
+|rss.jetty.maxPool.size|256|metrics service jetty max pool size|
+|rss.jetty.stop.timeout|30000|jetty stop timeout (ms) |
+|rss.jetty.http.idle.timeout|30000|jetty http idle timeout (ms) |
+|rss.rpc.client.type|GRPC|client rpc type|
+|rss.storage.type|-|Supports MEMORY_LOCALFILE, MEMORY_HDFS, 
MEMORY_LOCALFILE_HDFS|
+|rss.storage.data.replica|1|replica num of storage|
+|rss.rpc.executor.size|1000|thread number for grpc to process request|
+|rss.jvm.metrics.verbose.enable|true|enable jvm metrics verbose|
+
+## Coordinator
+
+|Property Name|Default|        Description|
+|---|---|---|
+|rss.coordinator.exclude.nodes.file.path|-|The path of configuration file 
which have exclude nodes|
+|rss.coordinator.exclude.nodes.check.interval.ms|60000|Update interval (ms) 
for exclude nodes|
+|rss.coordinator.server.heartbeat.timeout|30000|Timeout if can't get heartbeat 
from shuffle server|
+|rss.coordinator.server.periodic.output.interval.times|30|The periodic 
interval times of output alive nodes.The interval sec can be calculated by 
(rss.coordinator.server.heartbeat.timeout/3 * 
rss.coordinator.server.periodic.output.interval.times). Default output interval 
is 5min.|
+|rss.coordinator.assignment.strategy|PARTITION_BALANCE|Strategy for assigning 
shuffle server, PARTITION_BALANCE should be used for workload balance|
+|rss.coordinator.app.expired|60000|Application expired time (ms), the 
heartbeat interval should be less than it|
+|rss.coordinator.shuffle.nodes.max|9|The max number of shuffle server when do 
the assignment|
+|rss.coordinator.dynamicClientConf.path|-|The path of configuration file which 
have default conf for rss client|
+|rss.coordinator.access.checkers|org.apache.uniffle.coordinator.AccessClusterLoadChecker|The
 access checkers will be used when the spark client use the 
DelegationShuffleManager, which will decide whether to use rss according to the 
result of the specified access checkers|
+|rss.coordinator.access.loadChecker.serverNum.threshold|-|The minimal required 
number of healthy shuffle servers when being accessed by client. And when not 
specified, it will use the required shuffle-server number from client as the 
checking condition. If there is no client shuffle-server number specified, the 
coordinator conf of rss.coordinator.shuffle.nodes.max will be adopted|
+|rss.coordinator.access.candidates.updateIntervalSec|120|Accessed candidates 
update interval in seconds, which is only valid when AccessCandidatesChecker is 
enabled.|
+|rss.coordinator.access.candidates.path|-|Accessed candidates file path, the 
file can be stored on HDFS|
+|rss.coordinator.access.loadChecker.memory.percentage|15.0|The minimal 
percentage of available memory percentage of a server|
+|rss.coordinator.dynamicClientConf.enabled|false|whether to enable dynamic 
client conf, which will be fetched by spark client|
+|rss.coordinator.dynamicClientConf.path|-|The dynamic client conf of this 
cluster and can be stored in HDFS or local|
+|rss.coordinator.dynamicClientConf.updateIntervalSec|120|The dynamic client 
conf update interval in seconds|
+|rss.coordinator.remote.storage.cluster.conf|-|Remote Storage Cluster related 
conf with format $clusterId,$key=$value, separated by ';'|
+|rss.coordinator.remote.storage.select.strategy|APP_BALANCE|Strategy for 
selecting the remote path|
+|rss.coordinator.remote.storage.schedule.time|60000|The time of scheduling the 
read and write time of the paths to obtain different HDFS|
+|rss.coordinator.remote.storage.schedule.file.size|204800000|The size of the 
file that the scheduled thread reads and writes|
+|rss.coordinator.remote.storage.schedule.access.times|3|The number of times to 
read and write HDFS files|
+|rss.coordinator.assignment.host.strategy|-|Strategy for selecting shuffle 
servers|
+|rss.coordinator.startup-silent-period.enabled|false|Enable the 
startup-silent-period to reject the assignment requests for avoiding partial 
assignments. To avoid service interruption, this mechanism is disabled by 
default. Especially it's recommended to use in coordinator HA mode when 
restarting single coordinator.|
+|rss.coordinator.startup-silent-period.duration|20000|The waiting duration(ms) 
when conf of rss.coordinator.startup-silent-period.enabled is enabled.|
+
+## Shuffle Server
+
+|Property Name|Default|Description|
+|---|---|---|
+|rss.server.buffer.capacity|-|Max memory of buffer manager for shuffle server|
+|rss.server.read.buffer.capacity|10000|Max size of buffer for reading data|
+|rss.server.heartbeat.delay|10000|rss heartbeat initial delay (ms)|
+|rss.server.heartbeat.threadNum|2|rss heartbeat thread number|
+|rss.server.heartbeat.interval|10000|Heartbeat interval to Coordinator (ms)|
+|rss.server.heartbeat.timeout|10000|Heartbeat timeout to Coordinator (ms)|
+|rss.server.flush.threadPool.size|10|Thread pool for flush data to file|
+|rss.server.flush.threadPool.queue.size|Integer.MAX_VALUE|size of waiting 
queue for thread pool|
+|rss.server.flush.thread.alive|120|thread idle time in pool (s)|
+|rss.server.commit.timeout|600000|Timeout when commit shuffle data (ms)|
+rss.server.commit.check.interval.max.ms|10000|Max interval(ms) for check 
commit status|
+|rss.server.write.retry.max|10|Retry times when write fail|
+|rss.server.app.expired.withoutHeartbeat|60000|Expired time (ms) for 
application which has no heartbeat with coordinator|
+|rss.server.memory.request.retry.max|50|Max times to retry for memory request|
+|rss.server.preAllocation.expired|20000|Expired time (ms) for pre allocated 
buffer|
+|rss.server.write.slow.threshold|10000|Threshold for write slow defined|
+|rss.server.event.size.threshold.l1|200000|Threshold for event1 size|
+|rss.server.event.size.threshold.l2|200000|Threshold for event2 size|
+|rss.server.event.size.threshold.l3|200000|Threshold for event3 size|
+|rss.server.cleanup.threshold|10.0|Threshold for disk cleanup|
+|rss.server.pending.event.timeout.sec|600|If disk cannot be written for 
timeout seconds, the flush data event will fail|
+|rss.server.shuffle.expired.timeout.ms|120000|shuffle data reserve timeout|
+|rss.server.index.size.hint|2097152|The index file size hint|
+|rss.server.health.max.storage.usage.percentage|90.0|The usage percentage of a 
storage exceed the value, the disk become unavailable|
+|rss.server.health.storage.recovery.usage.percentage|80.0|The usage percentage 
of an unavailable storage decline the value, the disk will become available|
+|rss.server.health.check.interval.ms|5000|The interval for health check|
+|rss.server.health.min.storage.percentage|80|The minimum fraction of storage 
that must pass the check mark the node as healthy|
+|rss.server.health.check.enable|false|The switch for the health check|
+|rss.server.health.checker.class.names|-|The list of the Checker's name|
+|rss.server.memory.shuffle.highWaterMark.percentage|75.0|Threshold of spill 
data to storage, percentage of rss.server.buffer.capacity|
+|rss.server.memory.shuffle.lowWaterMark.percentage|25.0|Threshold of keep data 
in memory, percentage of rss.server.buffer.capacity|
+|rss.server.flush.cold.storage.threshold.size|64M| The threshold of data size 
for LOACALFILE and HDFS if MEMORY_LOCALFILE_HDFS is used|
+|rss.server.tags|-|The comma-separated list of tags to indicate the shuffle 
server's attributes. It will be used as the assignment basis for the 
coordinator|
+|rss.server.single.buffer.flush.enabled|false|Whether single buffer flush when 
size exceeded rss.server.single.buffer.flush.threshold|
+|rss.server.single.buffer.flush.threshold|64M|The threshold of single shuffle 
buffer flush|
+|rss.server.disk.capacity|-1|Disk capacity that shuffle server can use. If 
it's negative, it will use the default disk whole space|
+|rss.server.multistorage.fallback.max.fail.times|0|For multistorage, fail 
times exceed the number, will switch storage|
+|rss.server.localstorage.initialize.max.fail.number|0|For localstorage, it 
will exit when the failed initialized local storage exceed the number|
+|rss.server.single.buffer.flush.enabled|false|Whether single buffer flush when 
size exceeded rss.server.single.buffer.flush.threshold|
+|rss.server.single.buffer.flush.threshold|67108864|The threshold of single 
shuffle buffer flush|
+
+
+## Client
+
+### Common Setting
+
+>Common settings are common settings both used by spark and mr, but each 
carries different prefix.
+>`spark.` is used by the spark client and `mapreduce.` is used by the mr 
client.
+
+|Property Name|Default|Description|
+|---|---|---|
+|rss.c.ient.type|GRPC|rpc client type|
+|rss.client.retry.max|100|max rpc retry count|
+|rss.client.retry.interval.max|10000|rpc retry interval (ms)|
+|rss.coordinator.quorum|-|coordinator quorum to acess|
+|rss.data.replica|1|data replica num|
+|rss.data.replica.write|1|quorum write num|
+|rss.data.replica.read|1|quorum read num|
+|rss.data.replica.skip.enabled|true|enable data replica skipping|
+|rss.client.data.transfer.pool.size|auto get host's processors num|data 
transfer threadpool size|
+|rss.client.data.commit.pool.size|-1|data commit pool size|
+|rss.heartbeat.interval|10000|client heartbaet interval|
+|rss.heartbeat.timeout|-|client heartbeat timeout|
+|rss.storage.type|-|storage type of shuffle data, supports MEMORY_LOCALFILE, 
MEMORY_HDFS, MEMORY_LOCALFILE_HDFS|
+|rss.client.send.check.interval.ms|500|check send result interval (ms)|
+|rss.client.send.check.timeout.ms|600000|send result checkout timeout (ms)|
+|rss.writer.buffer.size|3m|Buffer size for single partition data|
+|rss.remote.storage.path|-|remote storage path|
+|rss.index.read.limit|500|index num limit for each read|
+|rss.client.read.buffer.size|14m|buffer size for each read|
+|rss.client.assignment.tags|-|client assigment tags|
+|rss.client.assignment.retry.interval|-|max client asssigment retry times|
+|rss.access.timeout.ms|10000|cluster access timeout|
+|rss.dynamicClientConf.enabled|true|use dynamic conf from coordinator|
+|rss.client.assignment.shuffle.nodes.max|-1|max assigned shuffle server nodes|
+
+### Spark Specialized Setting
+
+|Property Name|Default|Description|
+|---|---|---|
+|spark.rss.writer.serializer.buffer.size|3k|serializer buffer size|
+|spark.rss.writer.buffer.segment.size|3k|buffer size pre allocated for each 
records add|
+|spark.rss.writer.buffer.spill.size|128m|Buffer size for total partition data|
+|spark.rss.writer.pre.allocated.buffer.size|16m|allocated size for each 
request memory from spark executor|
+|spark.rss.writer.require.memory.retryMax|1200|max memory request times|
+|spark.rss.writer.require.memory.interval|1000|memory request interval (ms)|
+|spark.rss.client.heartBeat.threadNum|4|spark client heartbeat thread num|
+|spark.rss.client.send.size.limit|16m|The max data size sent to shuffle server|
+|spark.rss.client.unregister.thread.pool.size|10|The max size of thread pool 
of unregistering|
+|spark.rss.client.unregister.request.timeout.sec|10|The max timeout sec when 
doing unregister to remote shuffle-servers|
+|spark.rss.client.send.threadPool.size|10|client send data threadpool size of 
each executor|
+|spark.rss.client.send.threadPool.keepalive|60|maximum time that excess idle 
threads will wait for new tasks before terminating|
+|spark.rss.client.bitmap.splitNum|1|bitmap split num|
+|spark.rss.access.id|-|auto access id|
+|spark.rss.enabled|false|enable to use RssShuffleManager, this is used in auto 
access mode (DelegationRssShuffleManager)|
+|spark.rss.client.access.retry.interval.ms|20000|auto access retry interval|
+|spark.rss.client.access.retry.times|0|auto access retry count|
+
+### MapReduce Specialized Setting
+
+|Property Name|Default|Description|
+|---|---|---|
+|mapreduce.rss.client.send.threshold|0.2|threshold to send data|
+|mapreduce.rss.client.sort.memory.use.threshold|-|threshold of memory to sort 
in the client|
+|mapreduce.rss.client.max.buffer.size|3k|The max buffer size in map side|
+|mapreduce.rss.client.batch.trigger.num|50|The max batch of buffers to send 
data in map side|
+|mapreduce.rss.client.memory.threshold|0.8|Threshold to send data in the 
mapper|
+|mapreduce.rss.reduce.remote.spill.enable|false|Whether to use remote spill 
(Experimental, MEMORY_LOCAL_HDFS mode)|
+|mapreduce.rss.reduce.remote.spill.attempt.inc|1|Increase reduce attempts as 
hdfs is easier to crash than disk (Experimental, MEMORY_LOCAL_HDFS mode)|
+|mapreduce.rss.reduce.remote.spill.replication|1|The replication number to 
spill data to hdfs (Experimental, MEMORY_LOCAL_HDFS mode)|
+|mapreduce.rss.reduce.remote.spill.retries|5|The retry number to spill data to 
hdfs (Experimental, MEMORY_LOCAL_HDFS mode)|
+|mapreduce.rss.remote.storage.conf|-|remote storage cluster conf|
+
+## Security
+
+|Property Name|Default|Description|
+|---|---|---|
+|rss.security.hadoop.kerberos.enable|false|Whether enable access secured 
hadoop cluster|
+|rss.security.hadoop.kerberos.krb5-conf.file|-|The file path of krb5.conf. And 
only when rss.security.hadoop.kerberos.enable is enabled, the option will be 
valid|
+|rss.security.hadoop.kerberos.keytab.file|-|The kerberos keytab file path. And 
only when rss.security.hadoop.kerberos.enable is enabled, the option will be 
valid|
+|rss.security.hadoop.kerberos.principal|-|The kerberos keytab principal. And 
only when rss.security.hadoop.kerberos.enable is enabled, the option will be 
valid|
+|rss.security.hadoop.kerberos.relogin.interval.sec|60|The kerberos 
authentication relogin interval. unit: sec|
+|rss.jetty.ssl.enable|false|enable metrics service jetty ssl|
+|rss.jetty.https.port|-|metrics service jetty ssl port|
+|rss.jetty.ssl.keystore.path|-|jetty ssl keystore path|
+|rss.jetty.ssl.keymanager.password|-|jetty ssl keymanager password|
+|rss.jetty.ssl.keystore.password|-|jetty ssl keystore password|
+|rss.jetty.ssl.truststore.password|-|jetty ssl truststore password|
diff --git a/docs/07-build.md b/docs/07-build.md
new file mode 100644
index 0000000..8c3d860
--- /dev/null
+++ b/docs/07-build.md
@@ -0,0 +1,43 @@
+# Build Notes
+
+## Build requirements
+
+* Mac OS X or Linux
+* JDK 1.8, 64-bit
+* [Maven](https://maven.apache.org/)
+  
+## Building Uniffle
+ 
+### Build Minimal
+
+    mvn -DskipTests clean package
+
+### Build against profile Spark2(2.4.6)
+
+    mvn -DskipTests clean package -Pspark2
+
+### Build against profile Spark3(3.1.2)
+
+    mvn -DskipTests clean package -Pspark3
+
+### Build against Spark 3.2.x, Except 3.2.0
+
+    mvn -DskipTests clean package -Pspark3.2
+
+### Build against Spark 3.2.0
+
+    mvn -DskipTests clean package -Pspark3.2.0
+
+### Build distribution
+
+    ./build_distribution.sh
+
+### Build distribution Spark 3.2.0
+
+    ./build_distribution.sh --spark3-profile 'spark3.2.0'
+
+### Build distribution Spark 3.2.x (Not 3.2.0)
+
+    ./build_distribution.sh --spark3-profile 'spark3.2'
+
+rss-xxx.tgz will be generated.
\ No newline at end of file
diff --git a/docs/08-tuning.md b/docs/08-tuning.md
new file mode 100644
index 0000000..f6fd9d9
--- /dev/null
+++ b/docs/08-tuning.md
@@ -0,0 +1,21 @@
+# Tuning Uniffle
+
+## Local Disk
+
+Normally HDD is recommended since the shuffle data is non-critical and 
temporary. You can skip this part if you do not care the cost of SSD.Uniffle's 
shuffle server is a typical I/O-intensive application, it's suggested to use 
multiple disks to amortize the IO and increase the total 
throughput.[RAID10](https://www.techtarget.com/searchstorage/definition/RAID-10-redundant-array-of-independent-disks)
 is strongly recommended to use to get lower latency, high throughput, and good 
fault toler [...]
+
+## Flush Threshold
+
+One of the design principles of Uniffle is to handle both small and huge 
shuffles, therefore the shuffle server writes the small block to the local disk 
and the big block to the HDFS, and the threshold is configured by the parameter 
`rss.server.flush.cold.storage.threshold.size`. It is recommended to tune this 
parameter to make the ratio between the remote write total size and local write 
total size according to your workload and storage system's situation.
+
+## Total buffer size
+
+Uniffle use `rss.server.buffer.capacity` to constrait the memory consumed by 
the buffered shuffle data, however the memory used by the metadata is not 
controlled, and it is recommended to reserve a certain percentage of 
redundancy, for example 20%, in the the shuffler server's jvm process 
(`XMX_SIZE`) in additon to the total buffer (`rss.server.buffer.capacity` + 
`rss.server.read.buffer.capacity`).
+
+## Partition num and spark write buffer size
+
+Uniffle's spark client allocated buffer for each partition, and if the total 
memory used exceed the total buffer spill threshold 
(`spark.rss.writer.buffer.spill.size`) the client would send all the buffer 
data to the shuffle server, so it is recommmended to increase this threshold 
for the huge shuffle spark applications to avoid small network io.
+
+## Multiple Remote Storage
+
+Uniffle not only support multiple local disk but also supports using multiple 
remote storage with different configuration.

[incubator-uniffle-website] branch master updated: [878019401] add config build tuning docs (#15)

Reply via email to