This is an automated email from the ASF dual-hosted git repository.
nicholasjiang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/celeborn-website.git
The following commit(s) were added to refs/heads/main by this push:
new 5ac45be6 Add release note for 0.6.0 (#110)
5ac45be6 is described below
commit 5ac45be66dab504791fca2710ab2d6cd1149df44
Author: Fei Wang <[email protected]>
AuthorDate: Sun Jul 6 23:16:56 2025 -0700
Add release note for 0.6.0 (#110)
* Release 0.6.0
* Update
* Support Flink 2.0
* author mapping
---
.github/workflows/site.yaml | 1 +
docs/community/news.md | 1 +
docs/community/release_notes/release_note_0.6.0.md | 484 +++++++++++++++++++++
docs/download.md | 12 +
mkdocs.yml | 4 +-
5 files changed, 500 insertions(+), 2 deletions(-)
diff --git a/.github/workflows/site.yaml b/.github/workflows/site.yaml
index 81f42f54..d4903156 100644
--- a/.github/workflows/site.yaml
+++ b/.github/workflows/site.yaml
@@ -62,6 +62,7 @@ jobs:
- run: ./.github/bin/build_docs.sh 'tags/v0.5.2.tar.gz' '0.5.2'
- run: ./.github/bin/build_docs.sh 'tags/v0.5.3.tar.gz' '0.5.3'
- run: ./.github/bin/build_docs.sh 'tags/v0.5.4.tar.gz' '0.5.4'
+ - run: ./.github/bin/build_docs.sh 'tags/v0.6.0.tar.gz' '0.6.0'
- run: |
echo 'publish:' >> .asf.yaml
echo ' whoami: asf-site' >> .asf.yaml
diff --git a/docs/community/news.md b/docs/community/news.md
index 6004e02d..715dde51 100644
--- a/docs/community/news.md
+++ b/docs/community/news.md
@@ -20,6 +20,7 @@ license: |
| Date | Title
| Brief
|
|-------------------|---------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|
+| 2025 July 5 | Release 0.6.0
| Celeborn release 0.6.0.
|
| 2025 May 20 | New Committer: Jiaming Xie
| We are happy to announce Jiaming Xie becomes a new Celeborn committer.
|
| 2025 March 13 | Release 0.5.4
| Celeborn release 0.5.4.
|
| 2025 January 6 | Release 0.5.3
| Celeborn release 0.5.3.
|
diff --git a/docs/community/release_notes/release_note_0.6.0.md
b/docs/community/release_notes/release_note_0.6.0.md
new file mode 100644
index 00000000..02a10aa1
--- /dev/null
+++ b/docs/community/release_notes/release_note_0.6.0.md
@@ -0,0 +1,484 @@
+---
+hide:
+ - navigation
+
+license: |
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ https://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+---
+
+# Apache Celeborn™ 0.6.0 Release Notes
+
+## Highlight
+
+- Support Flink hybrid shuffle integration with Apache Celeborn
+- Support Spark 4.0.0
+- Support Flink 2.0
+- Support HARD_SPLIT in PushMergedData
+- Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files
+- Refine the celeborn RESTful APIs and introduce Celeborn CLI for automation
+- Supporting worker Tags
+- CongestionController supports control traffic by user/worker traffic speed
+- Support read shuffle from S3
+- Support CppClient in Celeborn
+- HELM Chart Optimization
+
+### Improvement
+
+- [CELEBORN-2045] Add logger sinks to allow persist metrics data and avoid
possible worker OOM
+- [CELEBORN-2046] Specify extractionDir of AsyncProfilerLoader with
celeborn.worker.jvmProfiler.localDir
+- [CELEBORN-2027] Allow CelebornShuffleReader to decompress data on demand
+- [CELEBORN-2003] Add retry mechanism when completing S3 multipart upload
+- [CELEBORN-2018] Support min number of workers selected for shuffle
+- [CELEBORN-2006] LifecycleManager should avoid parsing shufflePartitionType
every time
+- [CELEBORN-2007] Reduce PartitionLocation memory usage
+- [CELEBORN-2008] SlotsAllocator should select disks randomly in RoundRobin
mode
+- [CELEBORN-1896] delete data from failed to fetch shuffles
+- [CELEBORN-2004] Filter empty partition before createIntputStream
+- [CELEBORN-2002][MASTER] Audit shuffle lifecycle in separate log file
+- [CELEBORN-1993] CelebornConf introduces celeborn.<module>.io.threads to
specify number of threads used in the client thread pool
+- [CELEBORN-1995] Optimize memory usage for push failed batches
+- [CELEBORN-1994] Introduce disruptor dependency to support asynchronous
logging of log4j2
+- [CELEBORN-1965] Rely on all default hadoop providers for S3 auth
+- [CELEBORN-1982] Slot Selection Perf Improvements
+- [CELEBORN-1961] Convert Resource.proto from Protocol Buffers version 2 to
version 3
+- [CELEBORN-1931] use gather API for local flusher to optimize write io pattern
+- [CELEBORN-1916] Support Aliyun OSS Based on MPU Extension Interface
+- [CELEBORN-1925] Support Flink 2.0
+- [CELEBORN-1844][CIP-8] introduce tier writer proxy and simplify partition
data writer
+- [CELEBORN-1921] Broadcast large GetReducerFileGroupResponse to prevent Spark
driver network exhausted
+- [CELEBORN-1928][CIP-12] Support HARD_SPLIT in PushMergedData should support
handle older worker success response
+- [CELEBORN-1930][CIP-12] Support HARD_SPLIT in PushMergedData should handle
congestion control NPE issue
+- [CELEBORN-1577][PHASE2] QuotaManager should support interrupt shuffle
+- [CELEBORN-1856] Support stage-rerun when read partition by chunkOffsets when
enable optimize skew partition read
+- [CELEBORN-1894] Allow skipping already read chunks during unreplicated
shuffle read retried
+- [CELEBORN-1909] Support pre-run static code blocks of TransportMessages to
improve performance of protobuf serialization
+- [CELEBORN-1911] Move multipart-uploader to
multipart-uploader/multipart-uploader-s3 for extensibility
+- [CELEBORN-1910] Remove redundant synchronized of isTerminated in
ThreadUtils#sameThreadExecutorService
+- [CELEBORN-1898] SparkOutOfMemoryError compatible with Spark 4.0 and 4.1
+- [CELEBORN-1897] Avoid calling toString for too long messages
+- [CELEBORN-1882] Support configuring the SSL handshake timeout for SSLHandler
+- [CELEBORN-1883] Replace HashSet with ConcurrentHashMap.newKeySet for
ShuffleFileGroups
+- [CELEBORN-1858] Support DfsPartitionReader read partition by chunkOffsets
when enable optimize skew partition read
+- [CELEBORN-1879] Ignore invalid chunk range generated by
splitSkewedPartitionLocations
+- [CELEBORN-1857] Support LocalPartitionReader read partition by chunkOffsets
when enable optimize skew partition read
+- [CELEBORN-1876] Log remote address on RPC exception for
TransportRequestHandler
+- [CELEBORN-1865] Update master endpointRef when master leader is abnormal
+- [CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid
sorting shuffle files
+- [CELEBORN-1861] Support celeborn.worker.storage.baseDir.diskType option to
specify disk type of base directory for worker
+- [CELEBORN-1757] Add retry when sending RPC to LifecycleManager
+- [CELEBORN-1859] DfsPartitionReader and LocalPartitionReader should reuse
pbStreamHandlers get from BatchOpenStream request
+- [CELEBORN-1843] Optimize roundrobin for more balanced disk slot allocation
+- [CELEBORN-1854] Change receive revive request log level to debug
+- [CELEBORN-1841] Support custom implementation of EventExecutorChooser to
avoid deadlock when calling await in EventLoop thread
+- [CELEBORN-1847][CIP-8] Introduce local and DFS tier writer
+- [CELEBORN-1850] Setup worker endpoint after initalizing controller
+- [CELEBORN-1838] Interrupt spark task should not report fetch failure
+- [CELEBORN-1835][CIP-8] Add tier writer base and memory tier writer
+- [CELEBORN-1829] Replace waitThreadPoll's thread pool with
ScheduledExecutorService in Controller
+- [CELEBORN-1720] Prevent stage re-run if another task attempt is running or
successful
+- [CELEBORN-1482][CIP-8] Add partition meta handler
+- [CELEBORN-1812] Distinguish sorting-file from sort-tasks waiting to be
submitted
+- [CELEBORN-1737] Support build tez client package
+- [CELEBORN-1802] Fail the celeborn master/worker start if CELEBORN_CONF_DIR
is not directory
+- [CELEBORN-1701][FOLLOWUP] Support stage rerun for shuffle data lost
+- [CELEBORN-1413] Support Spark 4.0
+- [CELEBORN-1753] Optimize the code for `exists` and `find` method
+- [CELEBORN-1777] Add `java.security.jgss/sun.security.krb5` to
DEFAULT_MODULE_OPTIONS
+- [CELEBORN-1731] Support merged kv input for Tez
+- [CELEBORN-1732] Support unordered kv input for Tez
+- [CELEBORN-1733] Support ordered grouped kv input for Tez
+- [CELEBORN-1721][CIP-12] Support HARD_SPLIT in PushMergedData
+- [CELEBORN-1758] Remove the empty user resource consumption from worker
heartbeat
+- [CELEBORN-1729] Support ordered KV output for Tez
+- [CELEBORN-1730] Support unordered KV output for Tez
+- [CELEBORN-1612] Add a basic reader writer class to Tez
+- [CELEBORN-1725] Optimize performance of handling `MapperEnd` RPC in
`LifecycleManager`
+- [CELEBORN-1700] Flink supports fallback to vanilla Flink built-in shuffle
implementation
+- [CELEBORN-1621][CIP-11] Predefined worker tags expr via dynamic configs
+- [CELEBORN-1545] Add Tez plugin skeleton and dag app master
+- [CELEBORN-1530] support MPU for S3
+- [CELEBORN-1618][CIP-11] Supporting tags via DB Config Service
+- [CELEBORN-1726] Update WorkerInfo when transition worker state
+- [CELEBORN-1715] RemoteShuffleMaster should check
celeborn.client.push.replicate.enabled in constructor
+- [CELEBORN-1714] Optimize handleApplicationLost
+- [CELEBORN-1660] Cache available workers and only count the available workers
device free capacity
+- [CELEBORN-1619][CIP-11] Integrate tags manager with config service
+- [CELEBORN-1071] Support stage rerun for shuffle data lost
+- [CELEBORN-1697] Improve ThreadStackTrace for thread dump
+- [CELEBORN-1671] CelebornShuffleReader will try replica if create client
failed
+- [CELEBORN-1682] Add java tools.jar into classpath for JVM quake
+- [CELEBORN-1660] Using map for workers to find worker fast
+- [CELEBORN-1673] Support retry create client
+- [CELEBORN-1577][PHASE1] Storage quota should support interrupt shuffle
+- [CELEBORN-1642][CIP-11] Support multiple worker tags
+- [CELEBORN-1636] Client supports dynamic update of Worker resources on the
server
+- [CELEBORN-1651] Support ratio threshold of unhealthy disks for excluding
worker
+- [CELEBORN-1601] Support revise lost shuffles
+- [CELEBORN-1487][PHASE2] CongestionController support dynamic config
+- [CELEBORN-1648] Refine AppUniqueId with UUID suffix
+- [CELEBORN-1490][CIP-6] Introduce tier consumer for hybrid shuffle
+- [CELEBORN-1645] Introduce ShuffleFallbackPolicy to support custom
implementation of shuffle fallback policy for
CelebornShuffleFallbackPolicyRunner
+- [CELEBORN-1620][CIP-11] Support passing worker tags via RequestSlots message
+- [CELEBORN-1487][PHASE1] CongestionController support control traffic by
user/worker traffic speed
+- [CELEBORN-1637] Enhance config to bypass memory check for partition file
sorter
+- [CELEBORN-1638] Improve the slots allocator performance
+- [CELEBORN-1617][CIP-11] Support workers tags in FS Config Service
+- [CELEBORN-1615] Start the http server after all handlers added
+- [CELEBORN-1574] Speed up unregister shuffle by batch processing
+- [CELEBORN-1625] Add parameter `skipCompress` for pushOrMergeData
+- [CELEBORN-1597][CIP-11] Implement TagsManager
+- [CELEBORN-1490][CIP-6] Impl worker write process for Flink Hybrid Shuffle
+- [CELEBORN-1602] do hard split for push merged data RPC with disk full
+- [CELEBORN-1594] Refine dynamicConfig template and prevent NPE
+- [CELEBORN-1513] Support wildcard bind in dual stack environments
+- [CELEBORN-1587] Change to debug logging on client side for SortBasedPusher
trigger push
+- [CELEBORN-1496] Differentiate map results with only different stageAttemptId
+- [CELEBORN-1583] MasterClient#sendMessageInner should throw Throwable for
celeborn.masterClient.maxRetries is 0
+- [CELEBORN-1578] Make Worker#timer have thread name and daemon
+- [CELEBORN-1580] ReadBufferDispacther should notify exception to listener
+- [CELEBORN-1575] TimeSlidingHub should remove expire node when reading
+- [CELEBORN-1560] Remove usages of deprecated Files.createTempDir of Guava
+- [CELEBORN-1567] Support throw FetchFailedException when Data corruption
detected
+- [CELEBORN-1568] Support worker retries in MiniCluster
+- [CELEBORN-1563] Log networkLocation in WorkerInfo
+- [CELEBORN-1531] Refactor self checks in master
+- [CELEBORN-1550] Add support of providing custom dynamic store backend
implementation
+- [CELEBORN-1529] Read shuffle data from S3
+- [CELEBORN-1518] Add support for Apache Spark barrier stages
+- [CELEBORN-1542] Master supports to check the worker host pattern on worker
registration
+- [CELEBORN-1535] Support to disable master workerUnavailableInfo expiration
+- [CELEBORN-1511] Add support for custom master endpoint resolver
+- [CELEBORN-1541] Enhance the readable address for internal port
+- [CELEBORN-1533] Log location when CelebornInputStream#fillBuffer fails
+- [CELEBORN-1524] Support IPv6 hostnames for Apache Ratis
+- [CELEBORN-1519] Do not update estimated partition size if it is unchanged
+- [CELEBORN-1521] Introduce celeborn-spi module for authentication extensions
+- [CELEBORN-1469] Support writing shuffle data to OSS(S3 only)
+- [CELEBORN-1446] Enable chunk prefetch when initialize CelebornInputStream
+- [CELEBORN-1509] Reply response without holding a lock
+- [CELEBORN-1543] Support Flink 1.20
+- [CELEBORN-1504] Support for Apache Flink 1.16
+- [CELEBORN-1500] Filter out empty InputStreams
+- [CELEBORN-1495] CelebornColumnDictionary supports dictionary of float and
double column type
+- [CELEBORN-1494] Support IPv6 addresses in
PbSerDeUtils.fromPackedPartitionLocations
+- [CELEBORN-1483] Add storage policy
+- [CELEBORN-1479] Report register shuffle failed reason in exception
+- [CELEBORN-1489] Update Flink support with authentication support
+- [CELEBORN-1485] Refactor addCounter, addGauge and addTimer of AbstractSource
to reduce CPU utilization
+- [CELEBORN-1472] Reduce CongestionController#userBufferStatuses call times
+- [CELEBORN-1471] CelebornScalaObjectMapper supports configuring
FAIL_ON_UNKNOWN_PROPERTIES to false
+
+### Flink hybrid shuffle
+- [CELEBORN-1867][FLINK] Fix flink client memory leak of
TransportResponseHandler#outstandingRpcs for handling addCredit and
notifyRequiredSegment response
+- [CELEBORN-1866][FLINK] Fix CelebornChannelBufferReader request more buffers
than needed
+- [CELEBORN-1490][CIP-6] Support process large buffer in flink hybrid shuffle
+- [CELEBORN-1490][CIP-6] Add Flink hybrid shuffle doc
+- [CELEBORN-1490][CIP-6] Impl worker read process in Flink Hybrid Shuffle
+- [CELEBORN-1490][CIP-6] Introduce tier producer in celeborn flink client
+- [CELEBORN-1490][CIP-6] Introduce tier factory and master agent in flink
hybrid shuffle
+- [CELEBORN-1490][CIP-6] Enrich register shuffle method
+- [CELEBORN-1490][CIP-6] Extends FileMeta to support hybrid shuffle
+- [CELEBORN-1490][CIP-6] Extends message to support hybrid shuffle
+
+### RESTful API and CLI
+- [CELEBORN-1056][FOLLOWUP] Support upsert and delete of dynamic configuration
management
+- [CELEBORN-2020] Support http authentication for Celeborn CLI
+- [CELEBORN-1875] Support to get workers topology information with RESTful api
+- [CELEBORN-1797] Support to adjust the logger level with RESTful API during
runtime
+- [CELEBORN-1750] Return struct worker resource consumption information with
RESTful api
+- [CELEBORN-1707] Audit all RESTful api calls and use separate restAuditFile
+- [CELEBORN-1632] Support to apply ratis local raft_meta_conf command with
RESTful api
+- [CELEBORN-1599] Container Info REST API
+- [CELEBORN-1630] Support to apply ratis peer operation with RESTful api
+- [CELEBORN-1631] Support to apply ratis snapshot operation with RESTful api
+- [CELEBORN-1633] Return more raft group information
+- [CELEBORN-1629] Support to apply ratis election operation with RESTful api
+- [CELEBORN-1609] Support SSL for celeborn RESTful service
+- [CELEBORN-1608] Reduce redundant response for RESTful api
+- [CELEBORN-1607] Enable `useEnumCaseInsensitive` for openapi-generator
+- [CELEBORN-1589] Ensure master is leader for some POST request APIs
+- [CELEBORN-1572] Celeborn CLI initial REST API support
+- [CELEBORN-1553] Using the request base url as swagger server
+- [CELEBORN-1537] Support to remove workers unavailable info with RESTful api
+- [CELEBORN-1546] Support authorization on swagger UI
+- [CELEBORN-1477] Using openapi-generator `apache-httpclient` library instead
of `jersey2`
+- [CELEBORN-1493] Check admin privileges for http mutative requests
+- [CELEBORN-1477][CIP-9] Refine the celeborn RESTful APIs
+- [CELEBORN-1476] Enhance the RESTful response error msg
+- [CELEBORN-1475] Fix unknownExcludedWorkers filter for /exclude request
+- [CELEBORN-1318] Support celeborn http authentication
+
+### Monitoring
+- [CELEBORN-2024] Publish commit files fail count metrics
+- [CELEBORN-1892] Adding register with master fail count metric for worker
+- [CELEBORN-2005] Introduce numBytesIn, numBytesOut, numBytesInPerSecond,
numBytesOutPerSecond metrics for RemoteShuffleServiceFactory
+- [CELEBORN-1800] Introduce ApplicationTotalCount and ApplicationFallbackCount
metric to record the total and fallback count of application
+- [CELEBORN-1974] ApplicationId as metrics label should be behind a config flag
+- [CELEBORN-1968] Publish metric for unreleased partition location count when
worker was gracefully shutdown
+- [CELEBORN-1977] Add help/type on prometheus exposed metrics
+- [CELEBORN-1831] Add ratis commitIndex metrics
+- [CELEBORN-1817] add committed file size metrics
+- [CELEBORN-1791] All NettyMemoryMetrics should register to source
+- [CELEBORN-1804] Shuffle environment metrics of RemoteShuffleEnvironment
should use Shuffle.Remote metric group
+- [CELEBORN-1634][FOLLOWUP] Add rpc metrics into grafana dashboard
+- [CELEBORN-1766] Add detail metrics about fetch chunk
+- [CELEBORN-1756] Only gauge hdfs metrics if HDFS storage enabled to reduce
metrics
+- [CELEBORN-1634] Support queue time/processing time metrics for rpc framework
+- [CELEBORN-1706] Use bytes(IEC) unit instead of bytes(SI) for size related
metrics in prometheus dashboard
+- [CELEBORN-1685] ShuffleFallbackPolicy supports ShuffleFallbackCount metric
+- [CELEBORN-1680] Introduce ShuffleFallbackCount metrics
+- [CELEBORN-1444][FOLLOWUP] Add IsDecommissioningWorker to celeborn dashboard
+- [CELEBORN-1640] NettyMemoryMetrics supports numHeapArenas, numDirectArenas,
tinyCacheSize, smallCacheSize, normalCacheSize, numThreadLocalCaches and
chunkSize
+- [CELEBORN-1656] Remove duplicate UserProduceSpeed metrics
+- [CELEBORN-1627] Introduce `instance` variable for celeborn dashboard to
filter metrics
+- [CELEBORN-1582] Publish metric for unreleased shuffle count when worker was
decommissioned
+- [CELEBORN-1501] Introduce application dimension resource consumption metrics
of Worker
+- [CELEBORN-1586] Add available workers Metrics
+- [CELEBORN-1581] Fix incorrect metrics of DeviceCelebornFreeBytes and
DeviceCelebornTotalBytes
+- [CELEBORN-1491] introduce flusher working queue size metric
+
+### CPP Client
+- [CELEBORN-1978][CIP-14] Add code style checking for cppClient
+- [CELEBORN-1958][CIP-14] Add testsuite to test writing with javaClient and
reading with cppClient
+- [CELEBORN-1874][CIP-14] Add CICD procedure in github action for cppClient
+- [CELEBORN-1932][CIP-14] Adapt java's serialization to support cpp
serialization for GetReducerFileGroup/Response
+- [CELEBORN-1915][CIP-14] Add reader's ShuffleClient to cppClient
+- [CELEBORN-1906][CIP-14] Add CelebornInputStream to cppClient
+- [CELEBORN-1881][CIP-14] Add WorkerPartitionReader to cppClient
+- [CELEBORN-1871][CIP-14] Add NettyRpcEndpointRef to cppClient
+- [CELEBORN-1863][CIP-14] Add TransportClient to cppClient
+- [CELEBORN-1845][CIP-14] Add MessageDispatcher to cppClient
+- [CELEBORN-1836][CIP-14] Add Message to cppClient
+- [CELEBORN-1827][CIP-14] Add messageDecoder to cppClient
+- [CELEBORN-1821][CIP-14] Add controlMessages to cppClient
+- [CELEBORN-1819][CIP-14] Refactor cppClient with nested namespace
+- [CELEBORN-1814][CIP-14] Add transportMessage to cppClient
+- [CELEBORN-1809][CIP-14] Add partitionLocation to cppClient
+- [CELEBORN-1799][CIP-14] Add celebornConf to cppClient
+- [CELEBORN-1785][CIP-14] Add baseConf to cppClient
+- [CELEBORN-1772][CIP-14] Add memory module to cppClient
+- [CELEBORN-1761][CIP-14] Add cppProto to cppClient
+- [CELEBORN-1754][CIP-14] Add exceptions and checking utils to cppClient
+- [CELEBORN-1751][CIP-14] Add celebornException utils to cppClient
+- [CELEBORN-1740][CIP-14] Add stackTrace utils to cppClient
+- [CELEBORN-1741][CIP-14] Add processBase utils to cppClient
+- [CELEBORN-1724][CIP-14] Add environment setup tools for CppClient development
+
+### HELM Chart Optimization
+- [CELEBORN-2017][HELM] Add namespace to the metadata
+- [CELEBORN-1528][HELM] Use volume claim template to support various storage
backend
+- [CELEBORN-1996][HELM] Rename volumes.{master,worker} to
{master,worker}.volumes and {master.worker}.volumeMounts
+- [CELEBORN-1989][HELM] Split securityContext into master.podSecurityContext
and worker.podSecurityContext
+- [CELEBORN-1988][HELM] Split hostNetwork into master.hostNetwork and
worker.hostNetwork
+- [CELEBORN-1987][HELM] Split dnsPolicy into master.dnsPolicy and
worker.dnsPolicy
+- [CELEBORN-1981][HELM] Rename masterReplicas and workerReplicas to
master.replicas and worker.replicas
+- [CELEBORN-1985][HELM] Add new values master.envFrom and worker.envFrom
+- [CELEBORN-1986][HELM] Rename priorityClass.{master,worker} to
{master,worker}.priorityClass
+- [CELEBORN-1980][HELM] Split environments into master.env and worker.env
+- [CELEBORN-1972][HELM] Rename affinity.{master,worker} to
{master,worker}.affinity
+- [CELEBORN-1951][HELM] Rename resources.{master,worker} to
{master,worker}.resoruces
+- [CELEBORN-1962][HELM] Split tolerations into master.tolerations and
worker.tolerations
+- [CELEBORN-1955][HELM] Split nodeSelector into master.nodeSelector and
worker.nodeSelector
+- [CELEBORN-1953][HELM] Split podAnnotations into master.annotations and
worker.annotations
+- [CELEBORN-1954][HELM] Add a new value image.registry
+- [CELEBORN-1952][HELM] Define template helpers for master/worker respectively
+- [CELEBORN-1532][HELM] Read log4j2 and metrics configurations from file
+- [CELEBORN-1830] Chart statefulset resources key duplicate
+- [CELEBORN-1788] Add role and roleBinding helm charts
+- [CELEBORN-1786] add serviceAccount helm chart
+- [CELEBORN-1780] Add support for NodePort Service per Master replica
+- [CELEBORN-1552] automatically support prometheus to scrape metrics for helm
chart
+
+### Stability and Bug Fix
+- [CELEBORN-1721][FOLLOWUP] Return softsplit if there is no hardsplit for
pushMergeData
+- [CELEBORN-2043] Fix IndexOutOfBoundsException exception in
getEvictedFileWriter
+- [CELEBORN-2040] Avoid throw FetchFailedException when
GetReducerFileGroupResponse failed via broadcast
+- [CELEBORN-2042] Fix FetchFailure handling when TaskSetManager is not found
+- [CELEBORN-2033] updateProduceBytes should be called even if
updateProduceBytes throws exception
+- [CELEBORN-2025] RpcFailure Scala 2.13 serialization is incompatible
+- [CELEBORN-2022] Spark4 Client should package commons-io
+- [CELEBORN-2023] Spark4 Client incompatible with isLocalMaster method
+- [CELEBORN-2009] Commit files request failure should exclude worker in
LifecycleManager
+- [CELEBORN-2015] Retry IOException failures for RPC requests
+- [CELEBORN-1902] Read client throws PartitionConnectionException
+- [CELEBORN-2000] Ignore the getReducerFileGroup timeout before shuffle stage
end
+- [CELEBORN-1960] Fix PauseSpentTime only append the interval check time
+- [CELEBORN-1998] RemoteShuffleEnvironment should not register
InputChannelMetrics repeatedly
+- [CELEBORN-1999] OpenStreamTime should use requestId to record cost time
+- [CELEBORN-1855] LifecycleManager return appshuffleId for non barrier stage
when fetch fail has been reported
+- [CELEBORN-1948] Fix the issue where replica may lose data when HARD_SPLIT
occurs during handlePushMergeData
+- [CELEBORN-1912] Client should send heartbeat to worker for processing
heartbeat to avoid reading idleness of worker which enables heartbeat
+- [CELEBORN-1992] Ensure hadoop FS are not closed by hadoop ShutdownHookManager
+- [CELEBORN-1919] Hardsplit batch tracking should be disabled when pushing
only a single replica
+- [CELEBORN-1979] Change partition manager should respect the
`celeborn.storage.availableTypes`
+- [CELEBORN-1983] Fix fetch fail not throw due to reach spark maxTaskFailures
+- [CELEBORN-1976] CommitHandler should use
celeborn.client.rpc.commitFiles.askTimeout for timeout of doParallelCommitFiles
+- [CELEBORN-1966] Added fix to get userinfo in celeborn
+- [CELEBORN-1969] Remove celeborn.client.shuffle.mapPartition.split.enabled to
enable shuffle partition split at default for MapPartition
+- [CELEBORN-1970] Use StatusCode.fromValue instead of Utils.toStatusCode
+- [CELEBORN-1646][FOLLOWUP] DeviceMonitor should notifyObserversOnError with
CRITICAL_ERROR disk status for input/ouput error
+- [CELEBORN-1929] Avoid unnecessary buffer loss to get better buffer
reusability
+- [CELEBORN-1918] Add batchOpenStream time to fetch wait time
+- [CELEBORN-1947] Reduce log for CelebornShuffleReader sleeping before
inputStream ready
+- [CELEBORN-1923] Correct Celeborn available slots calculation logic
+- [CELEBORN-1914] incWriteTime when ShuffleWriter invoke pushGiantRecord
+- [CELEBORN-1822] Respond to RegisterShuffle with max epoch PartitionLocation
to avoid revive
+- [CELEBORN-1899] Fix configuration bug in shuffle s3
+- [CELEBORN-1885] Fix nullptr exceptions in FetchChunk after worker restart
+- [CELEBORN-1889] Fix scala 2.13 complie error
+- [CELEBORN-1846] Fix the StreamHandler usage in fetching chunk when task
attempt is odd
+- [CELEBORN-1792] MemoryManager resume should use pinnedDirectMemory instead
of usedDirectMemory
+- [CELEBORN-1832] MapPartitionData should create fixed thread pool with
registration of ThreadPoolSource
+- [CELEBORN-1820] Failing to write and flush StreamChunk data should be
counted as FETCH_CHUNK_FAIL
+- [CELEBORN-1818] Fix incorrect timeout exception when waiting on no pending
writes
+- [CELEBORN-1763] Fix DataPusher be blocked for a long time
+- [CELEBORN-1783] Fix Pending task in commitThreadPool wont be canceled
+- [CELEBORN-1782] Worker in congestion control should be in blacklist to avoid
impact new shuffle
+- [CELEBORN-1510] Partial task unable to switch to the replica
+- [CELEBORN-1778] Fix commitInfo NPE and add assert in
LifecycleManagerCommitFilesSuite
+- [CELEBORN-1670] Avoid swallowing InterruptedException in ShuffleClientImpl
+ Revert "- [CELEBORN-1376] Push data failed should always release request
body"
+- [CELEBORN-1770] FlushNotifier should setException for all Throwables in
Flusher
+- [CELEBORN-1769] Fix packed partition location cause
GetReducerFileGroupResponse lose location
+- [CELEBORN-1767] Fix occasional errors in UT when creating workers
+- [CELEBORN-1765] Fix NPE when removeFileInfo in StorageManager
+- [CELEBORN-1760] OOM causes disk buffer unable to be released
+- [CELEBORN-1743] Resolve the metrics data interruption and the job failure
caused by locked resources
+- [CELEBORN-1759] Fix reserve slots might lost partition location between 0.4
client and 0.5 server
+- [CELEBORN-1749] Fix incorrect application diskBytesWritten metrics
+- [CELEBORN-1713] RpcTimeoutException should include RPC address in message
+- [CELEBORN-1728] Fix NPE when failing to connect to celeborn worker
+- [CELEBORN-1727] Correct the calculation of worker diskInfo actualUsableSpace
+- [CELEBORN-1718] Fix memory storage file won't hard split when memory file is
full and worker has no disks
+- [CELEBORN-1705] Fix disk buffer size is negative issue
+- [CELEBORN-1686] Avoid return the same pushTaskQueue
+- [CELEBORN-1696] StorageManager#cleanFile should remove file info
+- [CELEBORN-1691] Fix the issue that upstream tasks don't rerun and the
current task still retry when failed to decompress in flink
+- [CELEBORN-1693] Fix storageFetcherPool concurrent problem
+- [CELEBORN-1674] Fix reader thread name of MapPartitionData
+- [CELEBORN-1668] Fix NPE when handle closed file writers
+- [CELEBORN-1669] Fix NullPointerException for
PartitionFilesSorter#updateSortedShuffleFiles after cleaning up expired shuffle
key
+- [CELEBORN-1667] Fix NPE & LEAK occurring prior to worker registration
+- [CELEBORN-1664] Fix secret fetch failures after LEADER master failover
+- [CELEBORN-1665] CommitHandler should process CommitFilesResponse with
COMMIT_FILE_EXCEPTION status
+- [CELEBORN-1663] Only register appShuffleDeterminate if stage using celeborn
for shuffle
+- [CELEBORN-1661] Make sure that the sortedFilesDb is initialized successfully
when worker enable graceful shutdown
+- [CELEBORN-1662] Handle PUSH_DATA_FAIL_PARTITION_NOT_FOUND in
getPushDataFailCause
+- [CELEBORN-1655] Fix read buffer dispatcher thread terminate unexpectedly
+- [CELEBORN-1646] Catch exception of Files#getFileStore for DeviceMonitor and
StorageManager for input/ouput error
+- [CELEBORN-1652] Throw TransportableError for failure of sending
PbReadAddCredit to avoid flink task get stuck
+- [CELEBORN-1643] DataPusher handle InterruptedException
+- [CELEBORN-1579] Fix the memory leak of result partition
+- [CELEBORN-1564] Fix actualUsableSpace of offerSlotsLoadAware condition on
diskInfo
+- [CELEBORN-1573] Change to debug logging on client side for reserve slots
+- [CELEBORN-1557] Fix totalSpace of DiskInfo for Master in HA mode
+- [CELEBORN-1549] Fix networkLocation persistence into Ratis
+- [CELEBORN-1558] Fix the incorrect decrement of pendingWrites in
handlePushMergeData
+- [CELEBORN-1547] Worker#listTopDiskUseApps should return
celeborn.metrics.app.topDiskUsage.count applications
+- [CELEBORN-1544] ShuffleWriter needs to call close finally to avoid memory
leaks
+- [CELEBORN-1473] TransportClientFactory should register netty memory metric
with source for shared pooled ByteBuf allocator
+- [CELEBORN-1522] Fix applicationId extraction from shuffle key
+- [CELEBORN-1516] DynamicConfigServiceFactory should support singleton
+- [CELEBORN-1515] SparkShuffleManager should set lifecycleManager to null
after stopping lifecycleManager in Spark 2
+- [CELEBORN-1439] Fix revive logic bug which will casue data correctness issue
and job failiure
+- [CELEBORN-1507] Prevent invalid Filegroups from being used
+- [CELEBORN-1478] Fix wrong use partitionId as shuffleId when readPartition
+
+### Build
+- [CELEBORN-2012] Add license for http5
+- [CELEBORN-1801] Remove out-of-dated flink 1.14 and 1.15
+- [CELEBORN-1746] Reduce the size of aws dependencies
+- [CELEBORN-1677][BUILD] Update SCM information for SBT build configuration
+- [CELEBORN-1649] Bumping up maven to 3.9.9
+- [CELEBORN-1658] Add Git Commit Info and Build JDK Spec to sbt Manifest
+- [CELEBORN-1659] Fix sbt make-distribution for cli
+- [CELEBORN-1616] Shade com.google.thirdparty to prevent dependency conflicts
+- [CELEBORN-1606] Generate dependencies-client-flink-1.16
+- [CELEBORN-1600] Enable check server dependencies
+- [CELEBORN-1556] Update Github actions to v4
+- [CELEBORN-1559] Fix make distribution script failed to recognize specific
profile
+- [CELEBORN-1526] Fix MR plugin can not run on Hadoop 3.1.0
+- [CELEBORN-1527] Error prone plugin should exclude
target/generated-sources/java of module
+- [CELEBORN-1505] Algin the celeborn server jackson dependency versions
+
+### Documentation
+- [CELEBORN-1870] Fix typos in in 'Developer' documents
+- [CELEBORN-1860] Remove unused celeborn.<module>.io.enableVerboseMetrics
option
+- [CELEBORN-1823] Remove unused remote-shuffle.job.min.memory-per-partition
and remote-shuffle.job.min.memory-per-gate
+- [CELEBORN-1811] Update default value for
`celeborn.master.slot.assign.extraSlots`
+- [CELEBORN-1789][DOC] Document on Java Columnar Shuffle
+- [CELEBORN-1774] Update default value of celeborn.<module>.io.mode to whether
epoll mode is available
+- [CELEBORN-1622][CIP-11] Adding documentation for Worker Tags feature
+- [CELEBORN-1755] Update doc to include S3 as one of storage layers
+- [CELEBORN-1752] Migration guide for unexpected shuffle RESTful api change
since 0.5.0
+- [CELEBORN-1745] Remove application top disk usage code
+- [CELEBORN-1719] Introduce celeborn.client.spark.stageRerun.enabled with
alternative celeborn.client.spark.fetch.throwsFetchFailure to enable spark
stage rerun
+- [CELEBORN-1687] Highlight flink session cluster issue in doc
+- [CELEBORN-1684] Fix ambiguous client jar expression of document
+- [CELEBORN-1678] Add Celeborn CLI User guide in README
+- [CELEBORN-1635] Introduce Blaze support document
+ [CLEBORN-1555] Replace deprecated config celeborn.storage.activeTypes in
docs and tests
+- [CELEBORN-1566] Update docs about using HDFS
+- [CELEBORN-1551] Fix wrong link in quota_management.md
+- [CELEBORN-1437][DOC] Merge METRICS.md into monitoring.md
+- [CELEBORN-1436][DOC] Move Rest API out from monitoring.md to webapi.md
+- [CELEBORN-1486] Introduce ClickHouse Backend in Gluten Support document
+- [CELEBORN-1466] Add local command in celeborn_ratis_shell.md
+
+### Dependencies
+- [CELEBORN-2030] Bump Spark from 3.5.5 to 3.5.6
+- [CELEBORN-2013] Upgrade scala binary version of spark-3.3, spark-3.4,
spark-3.5 profile to 2.13.8
+- [CELEBORN-1895] Bump log4j2 version to 2.24.3
+- [CELEBORN-1890] Bump Spark from 3.5.4 to 3.5.5
+- [CELEBORN-1884] Bump rocksdbjni version from 9.5.2 to 9.10.0
+- [CELEBORN-1877] Bump zstd-jni version from 1.5.2-1 to 1.5.7-1
+- [CELEBORN-1872] Bump Flink from 1.19.1, 1.20.0 to 1.19.2, 1.20.1
+- [CELEBORN-1864] Bump Netty version from 4.1.115.Final to 4.1.118.Final
+- [CELEBORN-1862] Bump Ratis version from 3.1.2 to 3.1.3
+- [CELEBORN-1842] Bump ap-loader version from 3.0-8 to 3.0-9
+- [CELEBORN-1806] Bump Spark from 3.5.3 to 3.5.4
+- [CELEBORN-1712] Bump Netty version from 4.1.109.Final to 4.1.115.Final
+- [CELEBORN-1748] Deprecate identity provider configs tied with quota
+- [CELEBORN-1702] Bump Ratis version from 3.1.1 to 3.1.2
+- [CELEBORN-1708] Bump protobuf version from 3.21.7 to 3.25.5
+- [CELEBORN-1709] Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826
+- [CELEBORN-1710] Bump commons-io version from 2.13.0 to 2.17.0
+- [CELEBORN-1672] Bump Spark from 3.4.3 to 3.4.4
+- [CELEBORN-1666] Bump scala-protoc from 1.0.6 to 1.0.7
+- [CELEBORN-1525] Bump Ratis version from 3.1.0 to 3.1.1
+- [CELEBORN-1613] Bump Spark from 3.5.2 to 3.5.3
+- [CELEBORN-1605] Bump commons-lang3 version from 3.13.0 to 3.17.0
+- [CELEBORN-1604] Bump rocksdbjni version from 8.11.3 to 9.5.2
+- [CELEBORN-1562] Bump Spark from 3.5.1 to 3.5.2
+- [CELEBORN-1512] Bump Flink from 1.19.0 to 1.19.1
+- [CELEBORN-1499] Bump Ratis version from 3.0.1 to 3.1.0
+
+## Credits
+
+Thanks to the following contributors who helped to review and commit to Apache
Celeborn 0.6.0 version:
+
+| Contributors | | | |
| |
+|-----------------|-----------------|-----------------|-----------------|-----------------|----------------------|
+| Aidar Bariev | Amandeep Singh | Aravind Patnam | Arsen Gumin | A
Vishnusankar | Biao Geng |
+| Binjie Yang | Björn Boschman | Bowen Liang | Cheng Pan |
Chongchen Chen | Erik Fang |
+| Fei Wang | Fu Chen | Guangwei Hong | Haotian Cao | He
Zhao | Jiaming Xie |
+| Jianfu Li | Jiashu Xiong | Jinqian Fan | Kerwin Zhang |
Keyong Zhou | Kun Wan |
+| Leo Li | Lianne Li | Madhukar | Minchu Yang |
Mingxiao Feng | Mridul Muralidharan |
+| Nan Zhu | Nicholas Jiang | Nicolas Fraison | Pengqi Li |
Sanskar Modi | Saurabh Dubey |
+| Shaoyun Chen | Shengjie Wang | Shlomi Uubul | Veli Yang |
Weijie Guo | Xianming Lei |
+| Xinyu Wang | Xu Huang | Yajun Gao | Yanze Jiang | Yi
Chen | Yi Zhu |
+| Yuting Wang | Yuxin Tan | Zhao Zhao | Zhaohui Xu |
Zhengqi Zhang | Zhentao Shuai |
+| Ziyi Wu | | | |
| |
diff --git a/docs/download.md b/docs/download.md
index 9b5d4b30..1d515c48 100644
--- a/docs/download.md
+++ b/docs/download.md
@@ -23,6 +23,17 @@ license: |
The latest version is {{ stable_version }}.
+
+### 0.6.0 (2025-07-05)
+
+[release note](community/release_notes/release_note_0.6.0.md)
+
+| | Download from ASF
|
Checksum |
Signature |
+|:-----------:|:--------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------:|
+| Source Code |
[src](https://downloads.apache.org/celeborn/celeborn-0.6.0/apache-celeborn-0.6.0-source.tgz)
|
[sha512](https://downloads.apache.org/celeborn/celeborn-0.6.0/apache-celeborn-0.6.0-source.tgz.sha512)
|
[asc](https://downloads.apache.org/celeborn/celeborn-0.6.0/apache-celeborn-0.6.0-source.tgz.asc)
|
+| Binary |
[bin](https://downloads.apache.org/celeborn/celeborn-0.6.0/apache-celeborn-0.6.0-bin.tgz)
|
[sha512](https://downloads.apache.org/celeborn/celeborn-0.6.0/apache-celeborn-0.6.0-bin.tgz.sha512)
|
[asc](https://downloads.apache.org/celeborn/celeborn-0.6.0/apache-celeborn-0.6.0-bin.tgz.asc)
|
+
+
### 0.5.4 (2025-03-13)
[release note](community/release_notes/release_note_0.5.4.md)
@@ -56,6 +67,7 @@ The latest version is {{ stable_version }}.
All celeborn releases are available via
[https://archive.apache.org/dist/celeborn/](https://archive.apache.org/dist/celeborn/)
including checksums and
signatures. At the time of writing, this includes the following versions:
+* Apache Celeborn 0.6.0 (2025-07-05)
([Source](https://archive.apache.org/dist/celeborn/celeborn-0.6.0/apache-celeborn-0.6.0-source.tgz),
[Binaries](https://archive.apache.org/dist/celeborn/celeborn-0.6.0/))
* Apache Celeborn 0.5.4 (2025-03-13)
([Source](https://archive.apache.org/dist/celeborn/celeborn-0.5.4/apache-celeborn-0.5.4-source.tgz),
[Binaries](https://archive.apache.org/dist/celeborn/celeborn-0.5.4/))
* Apache Celeborn 0.5.3 (2025-01-06)
([Source](https://archive.apache.org/dist/celeborn/celeborn-0.5.3/apache-celeborn-0.5.3-source.tgz),
[Binaries](https://archive.apache.org/dist/celeborn/celeborn-0.5.3/))
* Apache Celeborn 0.5.2 (2024-11-26)
([Source](https://archive.apache.org/dist/celeborn/celeborn-0.5.2/apache-celeborn-0.5.2-source.tgz),
[Binaries](https://archive.apache.org/dist/celeborn/celeborn-0.5.2/))
diff --git a/mkdocs.yml b/mkdocs.yml
index 4b21b701..554c215c 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -50,8 +50,8 @@ markdown_extensions:
- pymdownx.superfences
extra:
- version: 0.6.0-SNAPSHOT
- stable_version: 0.5.4
+ version: 0.7.0-SNAPSHOT
+ stable_version: 0.6.0
social:
- icon: fontawesome/brands/github