ZTE-EBASE opened a new pull request, #1226: URL: https://github.com/apache/cloudberry/pull/1226
… Data Ingestion gpfdist is a file distribution program in Cloudberry that can parallel load external data into the database. However, it has the drawback that data files must reside on the same machine as the tool. Therefore,extending it to support the SFTP protocol can address the above drawback and enable loading files from a remote server. <!-- Thank you for your contribution to Apache Cloudberry (Incubating)! --> Fixes #ISSUE_Number ### What does this PR do? By extending the `gpfdist` tool to support the SFTP protocol, remote data loading has been achieved, overcoming the challenge of having the tool and data files on the same machine. ### Type of Change New feature (non-breaking change) ### Test Plan <!-- How did you test these changes? --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Passed `make installcheck` - [ ] Passed `make -C src/test installcheck-cbdb-parallel` ### Impact <!-- Remove sections that don't apply --> **Performance:** <!-- Any performance implications? --> **User-facing changes:** <!-- Any changes visible to users? --> **Dependencies:** The ssh2 library needs to be introduced during compilation and placed under `/usr/local`. ### Checklist - [ ] Followed [contribution guide](https://cloudberry.apache.org/contribute/code) - [ ] Added/updated documentation - [ ] Reviewed code for security implications - [ ] Requested review from [cloudberry committers](https://github.com/orgs/apache/teams/cloudberry-committers) ### Additional Context Under this approach, the location template for the external table is: ```shell CREATE EXTERNAL TABLE ext1 (d varchar(20)) location ('gpfdist://ip:port/<sftp://sftp-user:passwd@sftp-hostip:sftp-port/file.csv>') format 'csv' (DELIMITER '|'); ``` Related Test Case: 1 Start gpfdist ```shell [cdbberry@node196 ~]$ gpfdist -d /home/cdbberry/ -p 9876 -l gpfdist.log & [1] 83161 [cdbberry@node196 ~]$ 2025-07-12 14:49:21 83161 INFO Before opening listening sockets - following listening sockets are available: 2025-07-12 14:49:21 83161 INFO IPV6 socket: [::]:9876 2025-07-12 14:49:21 83161 INFO IPV4 socket: 0.0.0.0:9876 2025-07-12 14:49:21 83161 INFO Trying to open listening socket: 2025-07-12 14:49:21 83161 INFO IPV6 socket: [::]:9876 2025-07-12 14:49:21 83161 INFO Opening listening socket succeeded 2025-07-12 14:49:21 83161 INFO Trying to open listening socket: 2025-07-12 14:49:21 83161 INFO IPV4 socket: 0.0.0.0:9876 2025-07-12 14:49:21 83161 INFO Opening listening socket succeeded Serving HTTP on port 9876, directory /home/cdbberry ``` 2 create table (external) ```shell CREATE table test( id int, name varchar(20) ); CREATE external table testww( id int, name varchar(20) ) location ('gpfdist://10.229.89.196:9876/<sftp://xxx:xxxx@xxx:22/xx.csv>') format 'csv' (delimiter as '|' NULL as '' FILL MISSING FIELDS) SEGMENT REJECT LIMIT 2 ROWS; ``` 3 data load ```shell insert into test select * from testww; ``` 4 result ```shell postgres=# insert into test select * from test_ext; INSERT 0 10 postgres=# select * from test; id | name ----+----------- 2 | ZTE-EBASE 3 | ZTE-EBASE 4 | ZTE-EBASE 6 | ZTE-EBASE 7 | ZTE-EBASE 8 | ZTE-EBASE 9 | ZTE-EBASE 10 | ZTE-EBASE 1 | ZTE-EBASE 5 | ZTE-EBASE (10 rows) ``` cat test.csv 1|ZTE-EBASE 2|ZTE-EBASE 3|ZTE-EBASE 4|ZTE-EBASE 5|ZTE-EBASE 6|ZTE-EBASE 7|ZTE-EBASE 8|ZTE-EBASE 9|ZTE-EBASE 10|ZTE-EBASE The amount and content of the table data are consistent with the file. ### CI Skip Instructions <!-- To skip CI builds, add the appropriate CI skip identifier to your PR title. The identifier must: - Be in square brackets [] - Include the word "ci" and either "skip" or "no" - Only use for documentation-only changes or when absolutely necessary --> --- <!-- Join our community: - Mailing list: [[email protected]](https://lists.apache.org/[email protected]) (subscribe: [email protected]) - Discussions: https://github.com/apache/cloudberry/discussions --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
