h499154897-cmyk opened a new issue, #10192: URL: https://github.com/apache/seatunnel/issues/10192
### Search before asking - [x] I had searched in the [feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement. ### Description Seatunnel currently focuses on structured/semi-structured data integration (e.g., reading text/CSV/JSON files from SFTP and writing content to S3/Ceph). However, it lacks the ability to support file-level direct passthrough (whole file transmission) between different file systems/storage protocols. The key limitations are: 1. Binary file incompatibility: Seatunnel parses files as text/structured data by default, which causes corruption or garbled content when handling binary files (e.g., ZIP, images, videos, executable files). 2. Loss of original file attributes: Cannot retain the original file name, modification time, access permissions, file size, and other metadata during transmission. 3. No whole file transmission: The current pipeline processes data line by line or in batches, rather than transmitting the entire file as a single unit, which is inefficient for large files. 4. Limited support for file system protocols: For storage like Ceph (CephFS/RGW), Seatunnel relies on S3-compatible sinks but cannot directly interact with CephFS or other file system protocols for passthrough. Expected Feature (File System Direct Passthrough) We propose adding a File System Passthrough feature to Seatunnel, which enables direct, whole-file transmission between different storage protocols without parsing or modifying the file content. The core capabilities should include: 1. Support for multiple storage protocols: - [ ] Source: SFTP, Local File System, HDFS, S3 (including Ceph RGW), CephFS, FTP/SFTP, etc. - [ ] Sink: Ceph (RGW/CephFS), S3, Local File System, HDFS, SFTP, OSS, COS, etc. 2. Whole file transmission: Transmit the entire file as a single unit (no line-by-line parsing) to support binary files and large files efficiently. 3. Preserve file attributes: - [ ] Retain original file names (critical for business scenarios). - [ ] Preserve metadata (modification time, access time, file permissions, file size, etc.). - [ ] Support custom file name mapping (e.g., adding prefixes/suffixes, renaming rules) if needed. 4. Batch and incremental transmission: - [ ] Support batch transmission of all files in a specified directory (including subdirectories). - [ ] Support incremental transmission (e.g., only transmit new/modified files since the last sync). 5. Filter and control capabilities: - [ ] Support file filtering via wildcards (e.g., *.log, data_*.zip) or regular expressions. - [ ] Support skipping empty files, hidden files, or files larger/smaller than a specified size. - [ ] Support configurable overwrite policies (e.g., overwrite existing files, skip, or append). 6. Seamless integration with existing Seatunnel pipelines: - [ ] Provide a dedicated FilePassthrough Source/Sink plugin (or extend existing file connectors with a "passthrough mode"). - [ ] Allow optional integration with Transform steps (e.g., adding file metadata as tags before transmission) for flexible customization. 新增文件系统透传功能,支持 SFTP、本地文件、HDFS、Ceph(RGW/CephFS)、S3 等协议间的整文件传输,核心能力包括: 支持二进制文件传输,不解析文件内容,直接透传; 保留原文件名、修改时间、权限等元数据; 支持批量目录同步、增量传输、文件过滤; 与现有 Seatunnel 管道无缝集成,可选择对文件元数据进行处理。 ### Usage Scenario _No response_ ### Related issues _No response_ ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
