Hi Hadoop community, I would like to start a discussion about adding Baidu Cloud BOS (Baidu Object Storage) as a native Hadoop-compatible filesystem connector.
JIRA: https://issues.apache.org/jira/browse/HDFS-11161 PR: https://github.com/apache/hadoop/pull/8347 CI Status: +1 overall, all checks passed. I have had some offline discussions with LuciferYang and the contributors working on this connector. Based on those discussions, I am helping bring this proposal to the Hadoop community for broader review and feedback. The goal is to integrate BOS support as a native Hadoop filesystem module, similar to the existing hadoop-aws (S3A), hadoop-aliyun, and hadoop-cos connectors. 1. Background Baidu Cloud is one of the major cloud service providers in China. BOS (Baidu Object Storage) is Baidu's core object storage service and is widely used for big data analytics, machine learning, and data lake workloads. A native Hadoop connector would allow Hadoop ecosystem projects, including MapReduce, Spark, Hive, Flink, and others, to access BOS storage directly through the bos:// scheme. According to the contributors, this connector has been running in production at Baidu for around 8 years, serving both BOS users and Baidu MapReduce (BMR) workloads. 2. Implementation The proposed module is placed under: hadoop-cloud-storage-project/hadoop-bos This follows the structure of the existing cloud storage connectors. The implementation includes: - A full Hadoop FileSystem implementation with the bos:// URI scheme - Pluggable credentials provider support - Contract tests covering standard filesystem operations - Dependency shading or exclusion to avoid classpath conflicts, with shaded dependencies placed under org.apache.hadoop.fs.bos.shaded.* 3. Long-term Maintenance The following contributors have expressed commitment to maintaining this module: - yangdong2398, BOS R&D - LuciferYang, Apache Spark PMC - jackylee-ch, Apache Gluten PMC - houzhizhen, Apache HugeGraph committer - summaryzb, Apache Uniffle committer They have committed to: - Responding to issues and PRs within one week - Keeping dependencies up to date - Adapting the connector to future Hadoop API changes 4. Why Consider Integrating This into Hadoop This proposal follows a similar rationale to hadoop-aws (S3A), hadoop-aliyun, and hadoop-cos: - Users can rely on a single, consistent Hadoop distribution without managing separate connector JARs and version compatibility manually - A connector maintained within the Hadoop community is easier for users to trust and review - Shared CI helps ensure ongoing compatibility with Hadoop trunk I would like to invite feedback from the community on whether this connector is appropriate to include in Hadoop, and what additional work, review, or requirements would be needed before it can be accepted. The contributors are copied / expected to participate in this discussion and can provide more details about the implementation, production usage, and maintenance plan. Best regards, Shilun Fan.
